Computerized methods and apparatus for incremental database backup using change tracking

ABSTRACT

Computerized methods and systems are disclosed for creating an incremental backup of application data by creating a snapshot associated with a current incremental backup of a data file using a change tracking bitmap such that a data file associated with the current incremental backup can be restored from just the snapshot associated with the current incremental backup and an initial backup without needing to access one or more previously generated incremental backups of the data file.

CROSS-REFERENCE TO RELATED APPLICATIONS

This disclosure claims the benefit of priority under 35 U.S.C. §119(e)to the following applications, the contents of are hereby incorporatedby reference in their entirety:

-   -   U.S. Pat. App. No. 62/063,162, filed Oct. 13, 2014, entitled        COPY DATA TECHNIQUES    -   U.S. Pat. App. No. 61/905,346, filed Nov. 18, 2013, entitled        COMPUTERIZED METHODS AND APPARATUS FOR INCREMENTAL DATABASE        BACKUP USING CHANGE TRACKING    -   U.S. Pat. App. No. 61/905,360, filed Nov. 18, 2013, entitled        DATA MANAGEMENT VIRTUALIZATION    -   U.S. Pat. App. No. 61/912,232, filed Dec. 5, 2013 entitled        COMPUTERIZED METHODS AND APPARATUS FOR DATA CLONING    -   U.S. Pat. App. No. 61/905,342, filed Nov. 18, 2013, entitled        TEST-AND-DEVELOPMENT WORKFLOW AUTOMATION

This disclosure is related to the following applications, the contentsof are hereby incorporated by reference in their entirety:

-   -   U.S. patent application Ser. No. ______, filed Nov. 18, 2014,        entitled SUCCESSIVE DATA FINGERPRINTING FOR COPY ACCURACY        ASSURANCE, Attorney Docket No. 2203828.00149US3    -   U.S. patent application Ser. No. ______, filed Nov. 18, 2014,        entitled DATA MANAGEMENT VIRTUALIZATION, Attorney Docket No.        2203828.00154US2    -   U.S. patent application Ser. No. ______, filed Nov. 18, 2014        entitled COMPUTERIZED METHODS AND APPARATUS FOR DATA CLONING,        Attorney Docket No. 2203828.00157US2    -   U.S. patent application Ser. No. ______, filed Nov. 18, 2014,        entitled TEST-AND-DEVELOPMENT WORKFLOW AUTOMATION, Attorney        Docket No. 2203828.00158US2

TECHNICAL FIELD

This invention relates generally to data management, data protection,and data verification.

BACKGROUND

The business requirements for managing the lifecycle of application datahave been traditionally met by deploying multiple point solutions, eachof which addresses a part of the lifecycle. This has resulted in acomplex and expensive infrastructure where multiple copies of data arecreated and moved multiple times to individual storage repositories. Theadoption of server virtualization has become a catalyst for simple,agile and low-cost compute infrastructure. This has led to largerdeployments of virtual hosts and storage, further exacerbating the gapbetween the emerging compute models and the current data managementimplementations.

Applications that provide business services depend on storage of theirdata at various stages of its lifecycle. FIG. 1 shows a typical set ofdata management operations that would be applied to the data of anapplication such as a database underlying a business service such aspayroll management. In order to provide a business service, application102 requires primary data storage 122 with some contracted level ofreliability and availability.

Backups 104 are made to guard against corruption or the primary datastorage through hardware or software failure or human error. Typicallybackups may be made daily or weekly to local disk or tape 124, and movedless frequently (weekly or monthly) to a remote physically securelocation 125.

Concurrent development and test 106 of new applications based on thesame database requires a development team to have access to another copyof the data 126. Such a snapshot might be made weekly, depending ondevelopment schedules.

Compliance with legal or voluntary policies 108 may require that somedata be retained for safely future access for some number of years;usually data is copied regularly (say, monthly) to a long-term archivingsystem 128.

Disaster Recovery services 110 guard against catastrophic loss of dataif systems providing primary business services fail due to some physicaldisaster. Primary data is copied 130 to a physically distinct locationas frequently as is feasible given other constraints (such as cost). Inthe event of a disaster the primary site can be reconstructed and datamoved back from the safe copy.

Business Continuity services 112 provide a facility for ensuringcontinued business services should the primary site become compromised.Usually this requires a hot copy 132 of the primary data that is innear-lockstep with the primary data, as well as duplicate systems andapplications and mechanisms for switching incoming requests to theBusiness Continuity servers.

Thus, data management is currently a collection of point applicationsmanaging the different parts of the lifecycle. This has been an artifactof evolution of data management solutions over the last two decades.

Current Data Management architecture and implementations such asdescribed above involve multiple applications addressing different partsof data lifecycle management, all of them performing certain commonfunctions: (a) make a copy of application data (the frequency of thisaction is commonly termed the Recovery Point Objective (RPO)), (b) storethe copy of data in an exclusive storage repository, typically in aproprietary format, and (c) retain the copy for certain duration,measured as Retention Time. A primary difference in each of the pointsolutions is in the frequency of the RPO, the Retention Time, and thecharacteristics of the individual storage repositories used, includingcapacity, cost and geographic location.

In a series of prior patent applications, e.g., U.S. Ser. No.12/947,375, a system and method for managing data has been presentedthat uses Data Management Virtualization. Data Management activities,such as Backup, Replication and Archiving are virtualized in that theydo not have to be configured and run individually and separately.Instead, the user defines their business requirement with regard to thelifecycle of the data, and the Data Management Virtualization Systemperforms these operations automatically. A snapshot is taken fromprimary storage to secondary storage; this snapshot is then used for abackup operation to other secondary storage. Essentially an arbitrarynumber of these backups may be made, providing a level of dataprotection specified by a Service Level Agreement.

The present application provides enhancements to the above system fordata management virtualization.

SUMMARY

According to some embodiments, computerized methods and systems aredisclosed for creating an incremental backup of application data bycreating a snapshot associated with a current incremental backup of adata file using a change tracking bitmap such that a data fileassociated with the current incremental backup can be restored from justthe snapshot associated with the current incremental backup and aninitial backup without needing to access one or more previouslygenerated incremental backups of the data file, each created at anearlier point in time than the point in time for the current incrementalbackup, the method comprising: receiving, by a computing device, a datafile to be monitored by the computing device; identifying, by thecomputing device, a prior change tracking bitmap associated with thedata file, the prior change tracking bitmap comprising data indicativeof changes made since a backup created at an earlier point in time thanthe point in time for the current incremental backup; determining, bythe computing device, blocks of data of the data file changed since theprior change tracking bitmap for the prior incremental backup;transmitting, by the computing device, to a backup device blocks of dataof the data file changed since the prior change tracking bitmap for theprior incremental backup; creating, by the computing device, acopy-on-write snapshot of the backup device to capture a point-in-timestate of the data file, such that the data file associated with thecurrent incremental backup can be restored from just the snapshotassociated with the current incremental backup and the initial backupwithout needing to access one or more previously generated incrementalbackups of the data file, each created at an earlier point in time thanthe point in time for the current incremental backup.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a simplified diagram of current methods deployed to manage thedata lifecycle for a business service.

FIG. 2 is an overview of the management of data throughout its lifecycleby a single Data Management Virtualization System.

FIG. 3 is a simplified block diagram of the Data ManagementVirtualization system.

FIG. 4 is a view of the Data Management Virtualization Engine.

FIG. 5 illustrates the Object Management and Data Movement Engine.

FIG. 6 shows the Storage Pool Manager.

FIG. 7 shows the decomposition of the Service Level Agreement.

FIG. 8 illustrates the Application Specific Module.

FIG. 9 shows the Service Policy Manager.

FIG. 10 is a flowchart of the Service Policy Scheduler.

FIG. 11 is a block diagram of the Content Addressable Storage (CAS)provider.

FIG. 12 shows the definition of an object handle within the CAS system.

FIG. 13 shows the data model and operations for the temporalrelationship graph stored for objects within the CAS.

FIG. 14 is a diagram representing the operation of a garbage collectionalgorithm in the CAS.

FIG. 15 is a flowchart for the operation of copying an object into theCAS.

FIG. 16 is a system diagram of a typical deployment of the DataManagement Virtualization system.

FIG. 17 is a schematic diagram of a characteristic physical serverdevice for use with the Data Management Virtualization system.

FIG. 18 is a schematic diagram showing the data model for a datafingerprint to be used in conjunction with certain embodiments of theinvention.

FIG. 19 is a system architecture diagram of a deployment of the DataManagement Virtualization system that incorporates data fingerprinting.

FIG. 20 is a system architecture diagram showing an application backingup a data set.

FIG. 21 is a diagram illustrating incremental copy of data filesperformed by a backup application during a backup process.

FIG. 22 is a diagram illustrating fingerprint verification, according tosome embodiments of the present disclosure.

FIG. 23 is a flowchart illustrating a fingerprint verification process.

FIG. 24 is an exemplary diagram illustrating a traditional incrementalbackup.

FIG. 25 is an exemplary diagram illustrating an incremental backup usinga change tracking driver, according to some embodiments.

FIGS. 26A and 26B are exemplary flow charts illustrating a computerizedmethod for incremental backup using a change tracking driver, accordingto some embodiments.

FIG. 27 is an exemplary table illustrating the lifecycle of a changetracking bitmap, according to some embodiments.

FIG. 28 is an exemplary diagram illustrating a change tracking driverdeployment, according to some embodiments.

FIG. 29 is an exemplary diagram illustrating a change tracking bitmap,according to some embodiments.

FIG. 30 is an exemplary flow chart illustrating a computerized methodfor starting change tracking for a file, according to some embodiments.

FIG. 31 is an exemplary flow chart illustrating a computerized methodfor terminating change tracking for a file, according to someembodiments.

FIG. 32 is an exemplary flow chart illustrating processing of filemodification notifications from the system, according to someembodiments.

FIG. 33 is an exemplary flow chart illustrating a computerized methodfor deleting a change tracking bitmap, according to some embodiments.

FIG. 34 is an exemplary diagram illustrating a change tracking driverdeployment on Hyper-V Server, according to some embodiments.

FIG. 35 is an exemplary diagram illustrating the creation process of alive clone image from a backup image of application, according to someembodiments.

FIG. 36A is an exemplary diagram illustrating the refresh process for alive clone image from a previously created backup image of anapplication, according to some embodiments.

FIG. 36B is an exemplary diagram illustrating a computerized method forrefreshing a live clone image from a previously created backup image ofan application, according to some embodiments.

FIG. 37 is an exemplary diagram illustrating the prep-mount process fora live clone image to scrub the live clone image, according to someembodiments.

FIGS. 38A and 38B are exemplary diagrams illustrating a prep-unmountoperation on a live clone image that has been prep-mounted to a host,according to some embodiments.

FIG. 39A is a flow diagram of an exemplary current process to procurecopy of production data for testing and developing business applications(Test-and-Development) for a business service.

FIG. 39B is a flow diagram of the new process leveraging workflowautomation technology to procure copy of production data for businessapplication development, according to some embodiments.

FIG. 39C is an exemplary diagram that illustrates the data flow for theTest-and-Development process in accordance with some embodiments.

FIG. 40 is a diagram that shows the decomposition of a workflow service,according to some embodiments.

FIG. 41 is a diagram that shows the decomposition of a workflow, whichis the main abstraction modeling the underlying data flow for thetest-and-development process and the basic operation unit by a workflowservice, according to some embodiments.

FIG. 42 is a flowchart showing the computerized execution of a workflowby a workflow service when triggered, according to some embodiments.

FIG. 43 is a flowchart depicting the execution of a workflow item,according to some embodiments.

FIG. 44 is a diagram depicting an exemplary graphical user interface forcreating a workflow, according to some embodiments.

FIG. 45 is a diagram depicting mounting of a live clone to multipleapplications, according to some embodiments.

FIG. 46 is a simplified block diagram of the relationship between theNAS systems and the copy data management components, according to someembodiments.

FIG. 47 is a block diagram of the detail of the interaction of the copydata management system and the NAS Backup System, according to someembodiments.

FIG. 48 is the sequence diagram illustrating the workflow of the firsttime data capture of the NAS system, according to some embodiments.

FIG. 49 is the sequence diagram illustrating the workflow of asubsequent data capture of the NAS system after the first time shown inFIG. 47, according to some embodiments.

FIG. 50 is the sequence diagram describing the workflow during therecovery or access of captured data for restore, according to someembodiments.

FIG. 51 is an exemplary table that compares features of the two snapshotservices, according to some embodiments.

DETAILED DESCRIPTION

This disclosure pertains to computerized methods and apparatus forcomputerized methods and apparatus for incremental database backup usingchange tracking

In the Data Management Virtualization system described below, a userdefines business requirements with regard to the lifecycle of the data,and the Data Management Virtualization System performs these operationsautomatically. A snapshot is taken from primary storage to secondarystorage; this snapshot is then used for a backup operation to othersecondary storage. Essentially an arbitrary number of these backups maybe made, providing a level of data protection specified by a ServiceLevel Agreement.

The data management engine is operable to execute a sequence of snapshotoperations to create point-in-time images of application data on a firststorage pool, each successive point-in-time image corresponding to aspecific, successive time-state of the application data, and eachsnapshot operation creating difference information indicating whichapplication data has changed and the content of the changed applicationdata for the corresponding time state. The data management engine isalso operable to execute at least one back-up function for theapplication data that is scheduled for execution at non-consecutivetime-states, and is also full of maintain history information havingtime-state information indicating the time-state of the last back-upfunction performed on the application data for a corresponding back-upcopy of data. The data management engine creates composite differenceinformation from the difference information for each time-state betweenthe time-state of the last back-up function performed on the applicationdata and the time-state of the currently-scheduled back-up function tobe performed on the application data, and sends the composite differenceinformation to a second storage pool to be compiled with the back-upcopy of data at the last time-state to create a back-up copy of data forthe current time-state.

Data Management Virtualization technology according to this disclosureis based on an architecture and implementation based on the followingguiding principles.

First, define the business requirements of an application with a ServiceLevel Agreement (SLA) for its entire data lifecycle. The SLA is morethan a single RPO, Retention and Recovery Time Objective (RTO). Itdescribes the data protection characteristics for each stage of the datalifecycle. Each application may have a different SLA.

Second, provide a unified Data Management Virtualization Engine thatmanages the data protection lifecycle, moving data across the variousstorage repositories, with improved storage capacity and networkbandwidth. The Data Management Virtualization system achieves theseimprovements by leveraging extended capabilities of modern storagesystems by tracking the portions of the data that have changed over timeand by data deduplication and compression algorithms that reduce theamount of data that needs to be copied and moved.

Third, leverage a single master copy of the application data to be thebasis for multiple elements within the lifecycle. Many of the DataManagement operations such as backup, archival and replication depend ona stable, consistent copy of the data to be protected. The DataManagement Virtualization System leverages a single copy of the data formultiple purposes. A single instance of the data maintained by thesystem may serve as the source, from which each data management functionmay make additional copies as needed. This contrasts with requiringapplication data to be copied multiple times by multiple independentdata management applications in the traditional approach.

Fourth, abstracting physical storage resources into a series of dataprotection storage pools, which are virtualized out of different classesof storage including local and remote disk, solid state memory, tape andoptical media, private, public and/or hybrid storage clouds. The storagepools provide access independent of the type, physical location orunderlying storage technology. Business requirements for the lifecycleof data may call for copying the data to different types of storagemedia at different times. The Data Management Virtualization systemallows the user to classify and aggregate different storage media intostorage pools, for example, a Quick Recovery Pool, which may includehigh speed disks, and a Cost Efficient Long-term Storage Pool, which maybe a deduplicated store on high capacity disks, or a tape library. TheData Management Virtualization System can move data amongst these poolsto take advantage of the unique characteristics of each storage medium.The abstraction of Storage Pools provides access independent of thetype, physical location or underlying storage technology.

Fifth, improve the movement of the data between storage pools anddisaster locations utilizing underlying device capabilities andpost-deduplicated application data. The Data Management VirtualizationSystem discovers the capabilities of the storage systems that includethe Storage Pools, and takes advantage of these capabilities to movedata efficiently. If the Storage System is a disk array that supportsthe capability of creating a snapshot or clone of a data volume, theData Management Virtualization System will take advantage of thiscapability and use a snapshot to make a copy of the data rather thanreading the data from one place and writing it to another. Similarly, ifa storage system supports change tracking, the Data ManagementVirtualization System will update an older copy with just the changes toefficiently create a new copy. When moving data across a network, theData Management Virtualization system uses a deduplication andcompression algorithm that avoids sending data that is already availableon the other side of the network.

One key aspect of improving data movement is recognizing thatapplication data changes slowly over time. A copy of an application thatis made today will, in general, have a lot of similarities to the copyof the same application that was made yesterday. In fact today's copy ofthe data could be represented as yesterday's copy with a series of deltatransformations, where the size of the delta transformations themselvesare usually much smaller than all of the data in the copy itself. TheData Management Virtualization system captures and records thesetransformations in the form of bitmaps or extent lists. In oneembodiment of the system, the underlying storage resources—a disk arrayor server virtualization system—are capable of tracking the changes madeto a volume or file; in these environments, the Data ManagementVirtualization system queries the storage resources to obtain thesechange lists, and saves them with the data being protected.

In the preferred embodiment of the Data Management Virtualizationsystem, there is a mechanism for eavesdropping on the primary dataaccess path of the application, which enables the Data ManagementVirtualization system to observe which parts of the application data aremodified, and to generate its own bitmap of modified data. If, forexample, the application modifies blocks 100, 200 and 300 during aparticular period, the Data Management Virtualization system willeavesdrop on these events, and create a bitmap that indicates that theseparticular blocks were modified. When processing the next copy ofapplication data, the Data Management Virtualization system will onlyprocess blocks 100, 200 and 300 since it knows that these were the onlyblocks that were modified.

In one embodiment of the system, where the primary storage for theapplication is a modern disk array or storage virtualization appliance,the Data Management Virtualization system takes advantage of apoint-in-time snapshot capability of an underlying storage device tomake the initial copy of the data. This virtual copy mechanism is afast, efficient and low-impact technique of creating the initial copythat does not guarantee that all the bits will be copied, or storedtogether. Instead, virtual copies are constructed by maintainingmetadata and data structures, such as copy-on-write volume bitmaps orextents, that allow the copies to be reconstructed at access time. Thecopy has a lightweight impact on the application and on the primarystorage device. In another embodiment, where the application is based ona Server Virtualization System such as VMware or Xen, the DataManagement Virtualization system uses the similarvirtual-machine-snapshot capability that is built into the ServerVirtualization systems. When a virtual copy capability is not available,the Data Management Virtualization System may include its own built-insnapshot mechanism.

It is possible to use the snapshot as a data primitive underlying all ofthe data management functions supported by the system. Because it islightweight, the snapshot can be used as an internal operation even whenthe requested operation is not a snapshot per se; it is created toenable and facilitate other operations.

At the time of creation of a snapshot, there may be certain preparatoryoperations involved in order to create a coherent snapshot or coherentimage, such that the image may be restored to a state that is usable bythe application. These preparatory operations need only be performedonce, even if the snapshot will be leveraged across multiple datamanagement functions in the system, such as backup copies which arescheduled according to a policy. The preparatory operations may includeapplication quiescence, which includes flushing data caches and freezingthe state of the application; it may also include other operations knownin the art and other operations useful for retaining a complete image,such as collecting metadata information from the application to bestored with the image.

FIG. 2 illustrates one way that a Virtualized Data Management system canaddress the data lifecycle requirements described earlier in accordancewith these principles.

To serve local backup requirements, a sequence of efficient snapshotsare made within local high-availability storage 202. Some of thesesnapshots are used to serve development/test requirements without makinganother copy. For longer term retention of local backup, a copy is madeefficiently into long-term local storage 204, which in thisimplementation uses deduplication to reduce repeated copying. The copieswithin long-term storage may be accessed as backups or treated as anarchive, depending on the retention policy applied by the SLA. A copy ofthe data is made to remote storage 206 in order to satisfy requirementsfor remote backup and business continuity—again a single set of copiessuffices both purposes. As an alternative for remote backup and disasterrecovery, a further copy of the data may be made efficiently to arepository 208 hosted by a commercial or private cloud storage provider.

The Data Management Virtualization System

FIG. 3 illustrates the high level components of the Data ManagementVirtualization System that implements the above principles. Preferably,the system includes these basic functional components further describedbelow.

Application 300 creates and owns the data. This is the software systemthat has been deployed by the user, as for example, an email system, adatabase system, or financial reporting system, in order to satisfy somecomputational need. The Application typically runs on a server andutilizes storage. For illustrative purposes, only one application hasbeen indicated. In reality there may be hundreds or even thousands ofapplications that are managed by a single Data Management VirtualizationSystem.

Storage Resources 302 is where application data is stored through itslifecycle. The Storage Resources are the physical storage assets,including internal disk drives, disk arrays, optical and tape storagelibraries and cloud-based storage systems that the user has acquired toaddress data storage requirements. The storage resources include PrimaryStorage 310, where the online, active copy of the application data isstored, and Secondary Storage 312 where additional copies of theapplication data are stored for the purposes such as backup, disasterrecovery, archiving, indexing, reporting and other uses. Secondarystorage resources may include additional storage within the sameenclosure as the primary storage, as well as storage based on similar ordifferent storage technologies within the same data center, anotherlocation or across the internet.

One or more Management Workstations 308 allow the user to specify aService Level Agreement (SLA) 304 that defines the lifecycle for theapplication data. A Management workstation is a desktop or laptopcomputer or a mobile computing device that is used to configure, monitorand control the Data Management Virtualization System. A Service LevelAgreement is a detailed specification that captures the detailedbusiness requirements related to the creation, retention and deletion ofsecondary copies of the application data. The SLA is more than thesimple RTO and RPO that are used in traditional data managementapplications to represent the frequency of copies and the anticipatedrestore time for a single class of secondary storage. The SLA capturesthe multiple stages in the data lifecycle specification, and allows fornon-uniform frequency and retention specifications within each class ofsecondary storage. The SLA is described in greater detail in FIG. 7.

Data Management Virtualization Engine 306 manages all of the lifecycleof the application data as specified in SLA. It manages potentially alarge number of SLAs for a large number of applications. The DataManagement Virtualization Engine takes inputs from the user through theManagement Workstation and interacts with the applications to discoverthe applications primary storage resources. The Data ManagementVirtualization Engine makes decisions regarding what data needs to beprotected and what secondary storage resources best fulfill theprotection needs. For example, if an enterprise designates itsaccounting data as requiring copies to be made at very short intervalsfor business continuity purposes as well as for backup purposes, theEngine may decide to create copies of the accounting data at a shortinterval to a first storage pool, and to also create backup copies ofthe accounting data to a second storage pool at a longer interval,according to an appropriate set of SLAs. This is determined by thebusiness requirements of the storage application.

The Engine then makes copies of application data using advancedcapabilities of the storage resources as available. In the aboveexample, the Engine may schedule the short-interval business continuitycopy using a storage appliance's built-in virtual copy or snapshotcapabilities. The Data Management Virtualization Engine moves theapplication data amongst the storage resources in order to satisfy thebusiness requirements that are captured in the SLA. The Data ManagementVirtualization Engine is described in greater detail in FIG. 4.

The Data Management Virtualization System as a whole may be deployedwithin a single host computer system or appliance, or it may be onelogical entity but physically distributed across a network ofgeneral-purpose and purpose-built systems. Certain components of thesystem may also be deployed within a computing or storage cloud.

In one embodiment of the Data Management Virtualization System the DataManagement Virtualization Engine largely runs as multiple processes on afault tolerant, redundant pair of computers. Certain components of theData Management Virtualization Engine may run close to the applicationwithin the application servers. Some other components may run close tothe primary and secondary storage, within the storage fabric or in thestorage systems themselves. The Management stations are typicallydesktop and laptop computers and mobile devices that connect over asecure network to the Engine.

The Data Management Virtualization Engine

FIG. 4 illustrates an architectural overview of the Data ManagementVirtualization Engine 306 according to certain embodiments of theinvention. The 306 Engine includes the following modules:

Application Specific Module 402. This module is responsible forcontrolling and collecting metadata from the application 300.Application metadata includes information about the application such asthe type of application, details about its configuration, location ofits datastores, its current operating state. Controlling the operationof the application includes actions such as flushing cached data todisk, freezing and thawing application I/O, rotating or truncating logfiles, and shutting down and restarting applications. The ApplicationSpecific module performs these operations and sends and receivesmetadata in responses to commands from the Service Level Policy Engine406, described below. The Application Specific Module is described inmore detail in connection with FIG. 8.

Service Level Policy Engine 406. This module acts on the SLA 304provided by the user to make decisions regarding the creation, movementand deletion of copies of the application data. Each SLA describes thebusiness requirements related to protection of one application. TheService Level Policy Engine analyzes each SLA and arrives at a series ofactions each of which involve the copying of application data from onestorage location to another. The Service Level Policy Engine thenreviews these actions to determine priorities and dependencies, andschedules and initiates the data movement jobs. The Service Level PolicyEngine is described in more detail in connection with FIG. 9.

Object Manager and Data Movement Engine 410. This module creates acomposite object consisting of the Application data, the ApplicationMetadata and the SLA which it moves through different storage pools perinstruction from the Policy Engine. The Object Manager receivesinstructions from the Service Policy Engine 406 in the form of a commandto create a copy of application data in a particular pool based on thelive primary data 413 belonging to the application 300, or from anexisting copy, e.g., 415, in another pool. The copy of the compositeobject that is created by the Object Manager and the Data MovementEngine is self contained and self describing in that it contains notonly application data, but also application metadata and the SLA for theapplication. The Object Manager and Data Movement Engine are describedin more detail in connection with FIG. 5.

Storage Pool Manager 412. This module is a component that adapts andabstracts the underlying physical storage resources 302 and presentsthem as virtual storage pools 418. The physical storage resources arethe actual storage assets, such as disk arrays and tape libraries thatthe user has deployed for the purpose of supporting the lifecycle of thedata of the user's applications. These storage resources might be basedon different storage technologies such as disk, tape, flash memory oroptical storage. The storage resources may also have differentgeographic locations, cost and speed attributes, and may supportdifferent protocols. The role of the Storage Pool Manager is to combineand aggregate the storage resources, and mask the differences betweentheir programming interfaces. The Storage Pool Manager presents thephysical storage resources to the Object Manager 410 as a set of storagepools that have characteristics that make these pools suitable forparticular stages in the lifecycle of application data. The Storage PoolManager is described in more detail in connection with FIG. 6.

Object Manager and Data Movement Engine

FIG. 5 illustrates the Object Manager and Data Movement Engine 410. TheObject Manager and Data Movement Engine discovers and uses VirtualStorage Resources 510 presented to it by the Pool Managers 504. Itaccepts requests from the Service Level Policy Engine 406 to create andmaintain Data Storage Object instances from the resources in a VirtualStorage Pool, and it copies application data among instances of storageobjects from the Virtual Storage Pools according to the instructionsfrom the Service Level Policy Engine. The target pool selected for thecopy implicitly designates the business operation being selected, e.g.backup, replication or restore. The Service Level Policy Engine resideseither locally to the Object Manager (on the same system) or remotely,and communicates using a protocol over standard networkingcommunication. TCP/IP may be used in a preferred embodiment, as it iswell understood, widely available, and allows the Service Level PolicyEngine to be located locally to the Object Manager or remotely withlittle modification.

In one embodiment, the system may deploy the Service Level Policy Engineon the same computer system as the Object Manager for ease ofimplementation. In another embodiment, the system may employ multiplesystems, each hosting a subset of the components if beneficial orconvenient for an application, without changing the design.

The Object Manager 501 and the Storage Pool Managers 504 are softwarecomponents that may reside on the computer system platform thatinterconnects the storage resources and the computer systems that usethose storage resources, where the user's application resides. Theplacement of these software components on the interconnect platform isdesignated as a preferred embodiment, and may provide the ability toconnect customer systems to storage via communication protocols widelyused for such applications (e.g. Fibre Channel, iSCSI, etc.), and mayalso provide ease of deployment of the various software components.

The Object Manager 501 and Storage Pool Manager 504 communicate with theunderlying storage virtualization platform via the ApplicationProgramming Interfaces made available by the platform. These interfacesallow the software components to query and control the behavior of thecomputer system and how it interconnects the storage resources and thecomputer system where the user's Application resides. The componentsapply modularity techniques as is common within the practice to allowreplacement of the intercommunication code particular to a givenplatform.

The Object Manager and Storage Pool Managers communicate via a protocol.These are transmitted over standard networking protocols, e.g. TCP/IP,or standard Interprocess Communication (IPC) mechanisms typicallyavailable on the computer system. This allows comparable communicationbetween the components if they reside on the same computer platform oron multiple computer platforms connected by a network, depending on theparticular computer platform. The current configuration has all of thelocal software components residing on the same computer system for easeof deployment. This is not a strict requirement of the design, asdescribed above, and can be reconfigured in the future as needed.

Object Manager

Object Manager 501 is a software component for maintaining Data StorageObjects, and provides a set of protocol operations to control it. Theoperations include creation, destruction, duplication, and copying ofdata among the objects, maintaining access to objects, and in particularallow the specification of the storage pool used to create copies. Thereis no common subset of functions supported by all pools; however, in apreferred embodiment, primary pools may be performance-optimized, i.e.lower latency, whereas backup or replication pools may becapacity-optimized, supporting larger quantities of data andcontent-addressable. The pools may be remote or local. The storage poolsare classified according to various criteria, including means by which auser may make a business decision, e.g. cost per gigabyte of storage.

First, the particular storage device from which the storage is drawn maybe a consideration, as equipment is allocated for different businesspurposes, along with associated cost and other practical considerations.Some devices may not even be actual hardware but capacity provided as aservice, and selection of such a resource can be done for practicalbusiness purposes.

Second, the network topological “proximity” is considered, as nearstorage is typically connected by low-latency, inexpensive networkresources, while distant storage may be connected by high-latency,bandwidth limited expensive network resources; conversely, the distanceof a storage pool relative to the source may be beneficial whengeographic diversity protects against a physical disaster affectinglocal resources.

Third, storage optimization characteristics are considered, where somestorage is optimized for space-efficient storage, but requirescomputation time and resources to analyze or transform the data beforeit can be stored, while other storage by comparison is “performanceoptimized,” taking more storage resources by comparison but usingcomparatively little computation time or resource to transform the data,if at all.

Fourth, “speed of access” characteristics are considered, where someresources intrinsic to a storage computer platform are readily andquickly made available to the user's Application, e.g. as a virtual SCSIblock device, while some can only be indirectly used. These ease andspeed of recovery is often governed by the kind of storage used, andthis allows it to be suitably classified.

Fifth, the amount of storage used and the amount available in a givenpool are considered, as there may be benefit to either concentrating orspreading the storage capacity used.

The Service Level Policy Engine, described below, combines the SLAprovided by the user with the classification criteria to determine howand when to maintain the application data, and from which storage poolsto draw the needed resources to meet the Service Level Agreement (SLA).

The object manager 501 creates, maintains and employs a historymechanism to track the series of operations performed on a data objectwithin the performance pools, and to correlate those operations withothers that move the object to other storage pools, in particularcapacity-optimized ones. This series of records for each data object ismaintained at the object manager for all data objects in the primarypool, initially correlated by primary data object, then correlated byoperation order: a time line for each object and a list of all such timelines. Each operation performed exploits underlying virtualizationprimitives to capture the state of the data object at a given point intime.

Additionally, the underlying storage virtualization appliance may bemodified to expose and allow retrieval of internal data structures, suchas bitmaps, that indicate the modification of portions of the datawithin the data object. These data structures are exploited to capturethe state of a data object at a point in time: e.g., a snapshot of thedata object, and to provide differences between snapshots taken at aspecific time, and thereby enables optimal backup and restore. While theparticular implementations and data structures may vary among differentappliances from different vendors, a data structure is employed to trackchanges to the data object, and storage is employed to retain theoriginal state of those portions of the object that have changed:indications in the data structure correspond to data retained in thestorage. When accessing the snapshot, the data structure is consultedand for portions that have been changed, the preserved data is accessedrather than the current data, as the data object has been modified atthe areas so indicated. A typical data structure employed is a bitmap,where each bit corresponds to a section of the data object. Setting thebit indicates that section has been modified after the point in time ofthe snapshot operation. The underlying snapshot primitive mechanismmaintains this for as long as the snapshot object exists.

The time line described above maintains a list of the snapshotoperations against a given primary data object, including the time anoperation is started, the time it is stopped (if at all), a reference tothe snapshot object, and a reference to the internal data structure(e.g. bitmaps or extent lists), so that it can be obtained from theunderlying system. Also maintained is a reference to the result ofcopying the state of the data object at any given point in time intoanother pool—as an example, copying the state of a data object into acapacity-optimized pool using content addressing results in an objecthandle. That object handle corresponds to a given snapshot and is storedwith the snapshot operation in the time line. This correlation is usedto identify suitable starting points.

Optimal backup and restore consult the list of operations from a desiredstarting point to an end point. A time ordered list of operations andtheir corresponding data structures (bitmaps) are constructed such thata continuous time series from start to finish is realized: there is nogap between start times of the operations in the series. This ensuresthat all changes to the data object are represented by the correspondingbitmap data structures. It is not necessary to retrieve all operationsfrom start to finish; simultaneously existing data objects andunderlying snapshots overlap in time; it is only necessary that thereare no gaps in time where a change might have occurred that was nottracked. As bitmaps indicate that a certain block of storage has changedbut not what the change is, the bitmaps may be added or composedtogether to realize a set of all changes that occurred in the timeinterval. Instead of using this data structure to access the state at apoint in time, the system instead exploits the fact that the datastructure represents data modified as time marches forward. Rather, theend state of the data object is accessed at the indicated areas, thusreturning the set of changes to the given data object from the givenstart time to the end time.

The backup operation exploits this time line, the correlated references,and access to the internal data structures to realize our backupoperation. Similarly, it uses the system in a complementary fashion toaccomplish our restore operation. The specific steps are described belowin the section for “Optimal Backup/Restore.”

Virtual Storage Pool Types

FIG. 5 illustrates several representative storage pool types. Althoughone primary storage pool and two secondary storage pools are depicted inthe figure, many more may be configured in some embodiments.

Primary Storage Pool 507—contains the storage resources used to createthe data objects in which the user Application stores its data. This isin contrast to the other storage pools, which exist to primarily fulfillthe operation of the Data Management Virtualization Engine.

Performance Optimized Pool 508—a virtual storage pool able to providehigh performance backup (i.e. point in time duplication, describedbelow) as well as rapid access to the backup image by the userApplication

Capacity Optimized Pool 509—a virtual storage pool that chiefly providesstorage of a data object in a highly space-efficient manner by use ofdeduplication techniques described below. The virtual storage poolprovides access to the copy of the data object, but does not do so withhigh performance as its chief aim, in contrast to the PerformanceOptimized pool above.

The initial deployments contain storage pools as described above, as aminimal operational set. The design fully expects multiple Pools of avariety of types, representing various combinations of the criteriaillustrated above, and multiple Pool Managers as is convenient torepresent all of the storage in future deployments. The tradeoffsillustrated above are typical of computer data storage systems.

From a practical point of view, these three pools represent a preferredembodiment, addressing most users requirements in a very simple way.Most users will find that if they have one pool of storage for urgentrestore needs, which affords quick recovery, and one other pool that islow cost, so that a large number of images can be retained for a largeperiod of time, almost all of the business requirements for dataprotection can be met with little compromise.

The format of data in each pool is dictated by the objectives andtechnology used within the pool. For example, the quick recovery pool ismaintained in the form very similar to the original data to minimize thetranslation required and to improve the speed of recovery. The long-termstorage pool, on the other hand, uses deduplication and compression toreduce the size of the data and thus reduce the cost of storage.

Object Management Operations 505

The Object Manager 501 creates and maintains instances of Data StorageObjects 503 from the Virtual Storage Pools 418 according to theinstructions sent to it by the Service Level Policy Engine 406. TheObject Manager provides data object operations in five major areas:point-in-time duplication or copying (commonly referred to as“snapshots”), standard copying, object maintenance, mapping and accessmaintenance, and collections.

Object Management operations also include a series of Resource Discoveryoperations for maintaining Virtual Storage Pools themselves andretrieving information about them. The Pool Manager 504 ultimatelysupplies the functionality for these.

Point-in-Time Copy (“Snapshot”) Operations

Snapshot operations create a data object instance representing aninitial object instance at a specific point in time. More specifically,a snapshot operation creates a complete virtual copy of the members of acollection using the resources of a specified Virtual Storage Pool. Thisis called a Data Storage Object. Multiple states of a Data StorageObject are maintained over time, such that the state of a Data StorageObject as it existed at a point in time is available. As describedabove, a virtual copy is a copy implemented using an underlying storagevirtualization API that allows a copy to be created in a lightweightfashion, using copy-on-write or other in-band technologies instead ofcopying and storing all bits of duplicate data to disk. This may beimplemented using software modules written to access the capabilities ofan off-the-shelf underlying storage virtualization system such asprovided by EMC, vmware or IBM in some embodiments. Where suchunderlying virtualizations are not available, the described system mayprovide its own virtualization layer for interfacing with unintelligenthardware.

Snapshot operations require the application to freeze the state of thedata to a specific point so that the image data is coherent, and so thatthe snapshot may later be used to restore the state of the applicationat the time of the snapshot. Other preparatory steps may also berequired. These are handled by the Application-Specific Module 302,which is described in a subsequent section. For live applications,therefore, the most lightweight operations are desired.

Snapshot operations are used as the data primitive for all higher-leveloperations in the system. In effect, they provide access to the state ofthe data at a particular point in time. As well, since snapshots aretypically implemented using copy-on-write techniques that distinguishwhat has changed from what is resident on disk, these snapshots providedifferences that can also be composed or added together to efficientlycopy data throughout the system. The format of the snapshot may be theformat of data that is copied by Data Mover 502, which is describedbelow.

Standard Copy Operations

When a copy operation is not a snapshot, it may be considered a standardcopy operation. A standard copy operation copies all or a subset of asource data object in one storage pool to a data object in anotherstorage pool. The result is two distinct objects. One type of standardcopy operation that may be used is an initial “baseline” copy. This istypically done when data is initially copied from one Virtual StoragePool into another, such as from a performance-optimized pool to acapacity-optimized storage pool. Another type of standard copy operationmay be used wherein only changed data or differences are copied to atarget storage pool to update the target object. This would occur afteran initial baseline copy has previously been performed.

A complete exhaustive version of an object need not be preserved in thesystem each time a copy is made, even though a baseline copy is neededwhen the Data Virtualization System is first initialized. This isbecause each virtual copy provides access to a complete copy. Any deltaor difference can be expressed in relation to a virtual copy instead ofin relation to a baseline. This has the positive side effect ofvirtually eliminating the common step of walking through a series ofchange lists.

Standard copy operations are initiated by a series of instructions orrequests supplied by the Pool Manager and received by the Data Mover tocause the movement of data among the Data Storage Objects, and tomaintain the Data Storage Objects themselves. The copy operations allowthe creation of copies of the specified Data Storage Objects using theresources of a specified Virtual Storage Pool. The result is a copy ofthe source Data Object in a target Data Object in the storage pool.

The Snapshot and Copy operations are each structured with a preparationoperation and an activation operation. The two steps of prepare andactivate allow the long-running resource allocation operations, typicalof the prepare phase, to be decoupled from the actuation. This isrequired by applications that can only be paused for a short while tofulfill the point-in-time characteristics of a snapshot operation, whichin reality takes a finite but non-zero amount of time to accomplish.Similarly for copy and snapshot operations, this two-step preparationand activation structure allows the Policy Engine to proceed with anoperation only if resources for all of the collection members can beallocated.

Object Maintenance

Object Maintenance operations are a series of operations for maintainingdata objects, including creation, destruction, and duplication. TheObject Manager and Data Mover use functionality provided by a PoolRequest Broker (more below) to implement these operations. The dataobjects may be maintained at a global level, at each Storage Pool, orpreferably both.

Collections

Collection operations are auxiliary functions. Collections are abstractsoftware concepts, lists maintained in memory by the object manager.They allow the Policy Engine 206 to request a series of operations overall of the members in a collection, allowing a consistent application ofa request to all members. The use of collections allows for simultaneousactivation of the point-in-time snapshot so that multiple Data StorageObjects are all captured at precisely the same point in time, as this istypically required by the application for a logically correct restore.The use of collections allows for convenient request of a copy operationacross all members of a collection, where an application would usemultiple storage objects as a logical whole.

Resource Discovery Operations

The Object Manager discovers Virtual Storage Pools by issuing ObjectManagement Operations 505 to the Pool Manager 504, and uses theinformation obtained about each of the pools to select one that meetsthe required criteria for a given request, or in the case where nonematch, a default pool is selected, and the Object Manager can thencreate a data storage object using resources from the selected VirtualStorage Pool.

Mapping and Access

The Object Manager also provides sets of Object Management operations toallow and maintain the availability of these objects to externalApplications. The first set is operations for registering andunregistering the computers where the user's Applications reside. Thecomputers are registered by the identities typical to the storagenetwork in use (e.g. Fibre Channel WWPN, iSCSI identity, etc.). Thesecond set is “mapping” operations, and when permitted by the storagepool from which an object is created, the Data Storage Object can be“mapped,” that is, made available for use to a computer on which a userApplication resides.

This availability takes a form appropriate to the storage, e.g. a blockdevice presented on a SAN as a Fibre Channel disk or iSCSI device on anetwork, a filesystem on a file sharing network, etc. and is usable bythe operating system on the Application computer. Similarly, an“unmapping” operation reverses the availability of the virtual storagedevice on the network to a user Application. In this way, data storedfor one Application, i.e. a backup, can be made available to anotherApplication on another computer at a later time, i.e. a restore.

502 Data Mover

The Data Mover 502 is a software component within the Object Manager andData Mover that reads and writes data among the various Data StorageObjects 503 according to instructions received from the Object Managerfor Snapshot (Point in Time) Copy requests and standard copy requests.The Data Mover provides operations for reading and writing data amonginstances of data objects throughout the system. The Data Mover alsoprovides operations that allow querying and maintaining the state oflong running operations that the Object Manager has requested for it toperform.

The Data Mover uses functionality from the Pool Functionality Providers(see FIG. 6) to accomplish its operation. The Snapshot functionalityprovider 608 allows creation of a data object instance representing aninitial object instance at a specific point in time. The DifferenceEngine functionality provider 614 is used to request a description ofthe differences between two data objects that are related in a temporalchain. For data objects stored on content-addressable pools, a specialfunctionality is provided that can provide differences between any twoarbitrary data objects. This functionality is also provided forperformance-optimized pools, in some cases by an underlying storagevirtualization system, and in other cases by a module that implementsthis on top of commodity storage. The Data Mover 502 uses theinformation about the differences to select the set of data that itcopies between instances of data objects 503.

For a given Pool, the Difference Engine Provider provides a specificrepresentation of the differences between two states of a Data StorageObject over time. For a Snapshot provider the changes between two pointsin time are recorded as writes to a given part of the Data StorageObject. In one embodiment, the difference is represented as a bitmapwhere each bit corresponds to an ordered list of the Data Object areas,starting at the first and ascending in order to the last, where a setbit indicates a modified area. This bitmap is derived from thecopy-on-write bitmaps used by the underlying storage virtualizationsystem. In another embodiment, the difference may be represented as alist of extents corresponding to changed areas of data. For a ContentAddressable storage provider 610, the representation is described below,and is used to determine efficiently the parts of two ContentAddressable Data Objects that differ.

The Data Mover uses this information to copy only those sections thatdiffer, so that a new version of a Data Object can be created from anexisting version by first duplicating it, obtaining the list ofdifferences, and then moving only the data corresponding to thosedifferences in the list. The Data Mover 502 traverses the list ofdifferences, moving the indicated areas from the source Data Object tothe target Data Object. (See Optimal Way for Data Backup and Restore.)

506 Copy Operation—Request Translation and Instructions

The Object Manager 501 instructs the Data Mover 502 through a series ofoperations to copy data among the data objects in the Virtual StoragePools 418. The procedure includes the following steps, starting at thereception of instructions:

First, create Collection request. A name for the collection is returned.

Second, add Object to Collection. The collection name from above is usedas well as the name of the source Data Object that is to be copied andthe name of two antecedents: a Data Object against which differences areto be taken in the source Storage Resource Pool, and a correspondingData Object in the target Storage Resource Pool. This step is repeatedfor each source Data Object to be operated on in this set.

Third, prepare Copy Request. The collection name is supplied as well asa Storage Resource Pool to act as a target. The prepare commandinstructs the Object Manager to contact the Storage Pool Manager tocreate the necessary target Data Objects, corresponding to each of thesources in the collection. The prepare command also supplies thecorresponding Data Object in the target Storage Resource Pool to beduplicated, so the Provider can duplicate the provided object and usethat as a target object. A reference name for the copy request isreturned.

Fourth, activate Copy Request. The reference name for the copy requestreturned above is supplied. The Data Mover is instructed to copy a givensource object to its corresponding target object. Each request includesa reference name as well as a sequence number to describe the overalljob (the entire set of source target pairs) as well as a sequence numberto describe each individual source-target pair. In addition to thesource-target pair, the names of the corresponding antecedents aresupplied as part of the Copy instruction.

Fifth, the Copy Engine uses the name of the Data Object in the sourcepool to obtain the differences between the antecedent and the sourcefrom the Difference Engine at the source. The indicated differences arethen transmitted from the source to the target. In one embodiment, thesedifferences are transmitted as bitmaps and data. In another embodiment,these differences are transmitted as extent lists and data.

503 Data Storage Objects

Data Storage Objects are software constructs that permit the storage andretrieval of Application data using idioms and methods familiar tocomputer data processing equipment and software. In practice thesecurrently take the form of a SCSI block device on a storage network,e.g. a SCSI LUN, or a content-addressable container, where a designatorfor the content is constructed from and uniquely identifies the datatherein. Data Storage Objects are created and maintained by issuinginstructions to the Pool Manager. The actual storage for persisting theApplication data is drawn from the Virtual Storage Pool from which theData Storage Object is created.

The structure of the data storage object varies depending on the storagepool from which it is created. For the objects that take the form of ablock device on a storage network, the data structure for a given blockdevice Data Object implements a mapping between the Logical BlockAddress (LBA) of each of the blocks within the Data Object to the deviceidentifier and LBA of the actual storage location. The identifier of theData Object is used to identify the set of mappings to be used. Thecurrent embodiment relies on the services provided by the underlyingphysical computer platform to implement this mapping, and relies on itsinternal data structures, such as bitmaps or extent lists.

For objects that take the form of a Content Addressable Container, thecontent signature is used as the identifier, and the Data Object isstored as is described below in the section about deduplication.

504 Pool Manager

A Pool Manager 504 is a software component for managing virtual storageresources and the associated functionality and characteristics asdescribed below. The Object manager 501 and Data Movement Engine 502communicate with one or more Pool Managers 504 to maintain Data StorageObjects 503.

510 Virtual Storage Resources

Virtual Storage Resources 510 are various kinds of storage madeavailable to the Pool Manager for implementing storage pool functions,as described below. In this embodiment, a storage virtualizer is used topresent various external Fibre Channel or iSCSI storage LUNs asvirtualized storage to the Pool Manager 504.

The Storage Pool Manager

FIG. 6 further illustrates the Storage Pool Manager 504. The purpose ofthe storage pool manager is to present underlying virtual storageresources to the Object Manager/Data Mover as Storage Resource Pools,which are abstractions of storage and data management functionality withcommon interfaces that are utilized by other components of the system.These common interfaces typically include a mechanism for identifyingand addressing data objects associated with a specific temporal state,and a mechanism for producing differences between data objects in theform of bitmaps or extents. In this embodiment, the pool managerpresents a Primary Storage Pool, a Performance Optimized Pool, and aCapacity Optimized Pool. The common interfaces allow the object managerto create and delete Data Storage objects in these pools, either ascopies of other data storage objects or as new objects, and the datamover can move data between data storage objects, and can use theresults of data object differencing operations.

The storage pool manager has a typical architecture for implementing acommon interface to diverse implementations of similar functionality,where some functionality is provided by “smart” underlying resources,and other functionality must be implemented on top of less functionalunderlying resources.

Pool request broker 602 and pool functionality providers 604 aresoftware modules executing in either the same process as the ObjectManager/Data Mover, or in another process communicating via a local ornetwork protocol such as TCP. In this embodiment the providers include aPrimary Storage provider 606, Snapshot provider 608, Content Addressableprovider 610, and Difference Engine provider 614, and these are furtherdescribed below. In another embodiment the set of providers may be asuperset of those shown here.

Virtual Storage Resources 510 are the different kinds of storage madeavailable to the Pool Manager for implementing storage pool functions.In this embodiment, the virtual storage resources include sets of SCSIlogical units from a storage virtualization system that runs on the samehardware as the pool manager, and accessible (for both data andmanagement operations) through a programmatic interface: in addition tostandard block storage functionality additional capabilities areavailable including creating and deleting snapshots, and trackingchanged portions of volumes. In another embodiment the virtual resourcescan be from an external storage system that exposes similarcapabilities, or may differ in interface (for example accessed through afile-system, or through a network interface such as CIFS, iSCSI orCDMI), in capability (for example, whether the resource supports anoperation to make a copy-on-write snapshot), or in non-functionalaspects (for example, high-speed/limited-capacity such as Solid StateDisk versus low-speed/high-capacity such as SATA disk). The capabilitiesand interface available determine which providers can consume thevirtual storage resources, and which pool functionality needs to beimplemented within the pool manager by one or more providers: forexample, this implementation of a content addressable storage provideronly requires “dumb” storage, and the implementation is entirely withincontent addressable provider 610; an underlying content addressablevirtual storage resource could be used instead with a simpler“pass-through” provider. Conversely, this implementation of a snapshotprovider is mostly “pass-through” and requires storage that exposes aquick point-in-time copy operation.

Pool Request Broker 602 is a simple software component that servicesrequests for storage pool specific functions by executing an appropriateset of pool functionality providers against the configured virtualstorage resource 510. The requests that can be serviced include, but arenot limited to, creating an object in a pool; deleting an object from apool; writing data to an object; reading data from an object; copying anobject within a pool; copying an object between pools; requesting asummary of the differences between two objects in a pool.

Primary storage provider 606 enables management interfaces (for example,creating and deleting snapshots, and tracking changed portions of files)to a virtual storage resource that is also exposed directly toapplications via an interface such as fibre channel, iSCSI, NFS or CIFS.

Snapshot provider 608 implements the function of making a point-in-timecopy of data from a Primary resource pool. This creates the abstractionof another resource pool populated with snapshots. As implemented, thepoint-in-time copy is a copy-on-write snapshot of the object from theprimary resource pool, consuming a second virtual storage resource toaccommodate the copy-on-write copies, since this managementfunctionality is exposed by the virtual storage resources used forprimary storage and for the snapshot provider.

Difference engine provider 614 can satisfy a request for two objects ina pool to be compared that are connected in a temporal chain. Thedifference sections between the two objects are identified andsummarized in a provider-specific way, e.g. using bitmaps or extents.For example, the difference sections might be represented as a bitmapwhere each set bit denotes a fixed size region where the two objectsdiffer; or the differences might be represented procedurally as a seriesof function calls or callbacks.

Depending on the virtual storage resource on which the pool is based, oron other providers implementing the pool, a difference engine mayproduce a result efficiently in various ways. As implemented, adifference engine acting on a pool implemented via a snapshot provideruses the copy-on-write nature of the snapshot provider to track changesto objects that have had snapshots made. Consecutive snapshots of asingle changing primary object thus have a record of the differencesthat is stored alongside them by the snapshot provider, and thedifference engine for snapshot pools simply retrieves this record ofchange. Also as implemented, a difference engine acting on a poolimplemented via a Content Addressable provider uses the efficient treestructure (see below, FIG. 12) of the content addressable implementationto do rapid comparisons between objects on demand.

Content addressable provider 610 implements a write-once contentaddressable interface to the virtual storage resource it consumes. Itsatisfies read, write, duplicate and delete operations. Each written orcopied object is identified by a unique handle that is derived from itscontent. The content addressable provider is described further below(FIG. 11).

Pool Manager Operations

In operation, the pool request broker 502 accepts requests for datamanipulation operations such as copy, snapshot, or delete on a pool orobject. The request broker determines which provider code from pool 504to execute by looking at the name or reference to the pool or object.The broker then translates the incoming service request into a form thatcan be handled by the specific pool functionality provider, and invokesthe appropriate sequence of provider operations.

For example, an incoming request could ask to make a snapshot from avolume in a primary storage pool, into a snapshot pool. The incomingrequest identifies the object (volume) in the primary storage pool byname, and the combination of name and operation (snapshot) determinesthat the snapshot provider should be invoked which can makepoint-in-time snapshots from the primary pool using the underlyingsnapshot capability. This snapshot provider will translate the requestinto the exact form required by the native copy-on-write functionperformed by the underlying storage virtualization appliance, such asbitmaps or extents, and it will translate the result of the nativecopy-on-write function to a storage volume handle that can be returnedto the object manager and used in future requests to the pool manager.

Optimal Way for Data Backup Using the Object Manager and Data Mover

Optimal Way for Data Backup is a series of operations to make successiveversions of Application Data objects over time, while minimizing theamount of data that must be copied by using bitmaps, extents and othertemporal difference information stored at the Object Mover. It storesthe application data in a data storage object and associates with it themetadata that relates the various changes to the application data overtime, such that changes over time can be readily identified.

In a preferred embodiment, the procedure includes the following steps:

1. The mechanism provides an initial reference state, e.g. T0, of theApplication Data within a Data Storage Object.2. Subsequent instances (versions) are created on demand over time ofthe Data Storage Object in a Virtual Storage Pool that has a DifferenceEngine Provider.3. Each successive version, e.g. T4, T5, uses the Difference EngineProvider for the Virtual Storage Pool to obtain the difference betweenit and the instance created prior to it, so that T5 is stored as areference to T4 and a set of differences between T5 and T4.4. The Copy Engine receives a request to copy data from one data object(the source) to another data object (the destination).5. If the Virtual Storage Pool in which the destination object will becreated contains no other objects created from prior versions of thesource data object, then a new object is created in the destinationVirtual Storage Pool and the entire contents of the source data objectare copied to the destination object; the procedure is complete.Otherwise the next steps are followed.6. If the Virtual Storage Pool in which the destination object iscreated contains objects created from prior versions of the source dataobject, a recently created prior version in the destination VirtualStorage Pool is selected for which there exists a corresponding priorversion in the Virtual Storage Pool of the source data object. Forexample, if a copy of T5 is initiated from a snapshot pool, and anobject created at time T3 is the most recent version available at thetarget, T3 is selected as the prior version.7. Construct a time-ordered list of the versions of the source dataobject, beginning with an initial version identified in the previousstep, and ending with the source data object that is about to be copied.In the above example, at the snapshot pool, all states of the object areavailable, but only the states including and following T3 are ofinterest: T3, T4, T5.8. Construct a corresponding list of the differences between eachsuccessive version in the list such that all of the differences, fromthe beginning version of the list to the end are represented. Differenceboth, identify which portion of data has changed and includes the newdata for the corresponding time. This creates a set of differences fromthe target version to the source version, e.g. the difference between T3and T5.9. Create the destination object by duplicating the prior version of theobject identified in Step 6 in the destination Virtual Storage Pool,e.g. object T3 in the target store.10. Copy the set of differences identified in the list created in Step 8from the source data object to the destination object; the procedure iscomplete.

Each data object within the destination Virtual Storage Pool iscomplete; that is, it represents the entire data object and allowsaccess to the all of the Application Data at the point in time withoutrequiring external reference to state or representations at other pointsin time. The object is accessible without replaying all deltas from abaseline state to the present state. Furthermore, the duplication ofinitial and subsequent versions of the data object in the destinationVirtual Storage Pool does not require exhaustive duplication of theApplication Data contents therein. Finally, to arrive at second andsubsequent states requires only the transmission of the changes trackedand maintained, as described above, without exhaustive traversal,transmission or replication of the contents of the data storage object.

Optimal Way for Data Restore Using the Object Manager and Data Mover

Intuitively, the operation of the Optimal Way for Data Restore is theconverse of the Optimal Way for Data Backup. The procedure to recreatethe desired state of a data object in a destination Virtual Storage Poolat a given point in time includes the following steps:

1. Identify a version of the data object in another Virtual Storage Poolthat has a Difference Engine Provider, corresponding to the desiredstate to be recreated. This is the source data object in the sourceVirtual Storage Pool.2. Identify a preceding version of the data object to be recreated inthe destination Virtual Storage Pool.3. If no version of the data object is identified in Step 2, then createa new destination object in the destination Virtual Storage Pool andcopy the data from the source data object to the destination dataobject. The procedure is complete. Otherwise, proceed with the followingsteps.4. If a version of the data object is identified in Step 2, thenidentify a data object in the source Virtual Storage Pool correspondingto the data object identified in Step 2.5. If no data object is identified in Step 4, then create a newdestination object in the destination Virtual Storage Pool and copy thedata from the source data object to the destination data object. Theprocedure is complete. Otherwise, proceed with the following steps.6. Create a new destination data object in the Destination VirtualStorage Pool by duplicating the data object identified in Step 2.7. Employ the Difference Engine Provider for the source Virtual StoragePool to obtain the set of differences between the data object identifiedin Step 1 and the data object identified in Step 4.8. Copy the data identified by the list created in Step 7 from thesource data object to the destination data object. The procedure iscomplete.

Access to the desired state is complete: it does not require externalreference to other containers or other states. Establishing the desiredstate given a reference state requires neither exhaustive traversal norexhaustive transmission, only the retrieved changes indicated by theprovided representations within the source Virtual Storage Pool.

The Service Level Agreement

FIG. 7 illustrates the Service Level Agreement. The Service LevelAgreement captures the detailed business requirements with respect tosecondary copies of the application data. In the simplest description,the business requirements define when and how often copies are created,how long they are retained and in what type of storage pools thesecopies reside. This simplistic description does not capture severalaspects of the business requirements. The frequency of copy creation fora given type of pool may not be uniform across all hours of the day oracross all days of a week. Certain hours of the day, or certain days ofa week or month may represent more (or less) critical periods in theapplication data, and thus may call for more (or less) frequent copies.Similarly, all copies of application data in a particular pool may notbe required to be retained for the same length of time. For example, acopy of the application data created at the end of monthly processingmay need to be retained for a longer period of time than a copy in thesame storage pool created in the middle of a month.

The Service Level Agreement 304 of certain embodiments has been designedto represent all of these complexities that exist in the businessrequirements. The Service Level Agreement has four primary parts: thename, the description, the housekeeping attributes and a collection ofService Level Policies. As mentioned above, there is one SLA perapplication.

The name attribute 701 allows each Service Level Agreement to have aunique name.

The description attribute 702 is where the user can assign a helpfuldescription for the Service Level Agreement.

The Service Level agreement also has a number of housekeeping attributes703 that enable it to be maintained and revised. These attributesinclude but are not limited to the owner's identity, the dates and timesof creation, modification and access, priority, enable/disable flags.

The Service Level Agreement also contains a plurality of Service LevelPolicies 705. Some Service level Agreements may have just a singleService Level Policy. More typically, a single SLA may contain tens ofpolicies.

Each Service Level Policy includes at least the following, in certainembodiments: the source storage pool location 706 and type 708; thetarget storage pool location 710 and type 712; the frequency for thecreation of copies 714, expressed as a period of time; the length ofretention of the copy 716, expressed as a period of time; the hours ofoperation 718 during the day for this particular Service Level Policy;and the days of the week, month or year 720 on which this Service LevelPolicy applies.

Each Service Level Policy specifies a source and target storage pool,and the frequency of copies of application data that are desired betweenthose storage pools. Furthermore, the Service Level Policy specifies itshours of operation and days on which it is applicable. Each ServiceLevel Policy is the representation of one single statement in thebusiness requirements for the protection of application data. Forexample, if a particular application has a business requirement for anarchive copy to be created each month after the monthly close andretained for three years, this might translate to a Service level Policythat requires a copy from the Local Backup Storage Pool into theLong-term Archive Storage Pool at midnight on the last day of the month,with a retention of three years.

All of the Service Level Policies with a particular combination ofsource and destination pool and location, say for example, sourcePrimary Storage pool and destination local Snapshot pool, when takentogether, specify the business requirements for creating copies intothat particular destination pool. Business requirements may dictate forexample that snapshot copies be created every hour during regularworking hours, but only once every four hours outside of these times.Two Service Level Policies with the same source and target storage poolswill effectively capture these requirements in a form that can be putinto practice by the Service Policy Engine.

This form of a Service Level Agreement allows the representation of theschedule of daily, weekly and monthly business activities, and thuscaptures business requirements for protecting and managing applicationdata much more accurately than traditional RPO and RPO based schemes. Byallowing hour of operation and days, weeks, and months of the year,scheduling can occur on a “calendar basis.”

Taken together, all of the Service Level Policies with one particularcombination of source and destinations, for example, “source: localprimary and destination: local performance optimized”, captures thenon-uniform data protection requirements for one type of storage. Asingle RPO number, on the other hand, forces a single uniform frequencyof data protection across all times of day and all days. For example, acombination of Service Level Policies may require a large number ofsnapshots to be preserved for a short time, such as 10 minutes, and alesser number of snapshots to be preserved for a longer time, such as 8hours; this allows a small amount of information that has beenaccidentally deleted can be reverted to a state not more than 10 minutesbefore, while still providing substantial data protection at longer timehorizons without requiring the storage overhead of storing all snapshotstaken every ten minutes. As another example, the backup data protectionfunction may be given one Policy that operates with one frequency duringthe work week, and another frequency during the weekend.

When Service Level Policies for all of the different classes of sourceand destination storage are included, the Service Level Agreement fullycaptures all of the data protection requirements for the entireapplication, including local snapshots, local long duration stores,off-site storage, archives, etc. A collection of policies within a SLAis capable of expressing when a given function should be performed, andis capable of expressing multiple data management functions that shouldbe performed on a given source of data.

Service Level Agreements are created and modified by the user through auser interface on a management workstation. These agreements areelectronic documents stored by the Service Policy Engine in a structuredSQL database or other repository that it manages. The policies areretrieved, electronically analyzed, and acted upon by the Service PolicyEngine through its normal scheduling algorithm as described below.

FIG. 8 illustrates the Application Specific Module 402. The ApplicationSpecific module runs close to the Application 300 (as described above),and interacts with the Application and its operating environment togather metadata and to query and control the Application as required fordata management operations.

The Application Specific Module interacts with various components of theapplication and its operating environment including Application ServiceProcesses and Daemons 801, Application Configuration Data 802, OperatingSystem Storage Services 803 (such as VSS and VDS on Windows), LogicalVolume Management and Filesystem Services 804, and Operating SystemDrivers and Modules 805.

The Application Specific Module performs these operations in response tocontrol commands from the Service Policy Engine 406. There are twopurposes for these interactions with the application: MetadataCollection and Application Consistency.

Metadata Collection is the process by which the Application SpecificModule collects metadata about the application. In some embodiments,metadata includes information such as: configuration parameters for theapplication; state and status of the application; control files andstartup/shutdown scripts for the application; location of the datafiles,journal and transaction logs for the application; and symbolic links,filesystem mount points, logical volume names, and other such entitiesthat can affect the access to application data.

Metadata is collected and saved along with application data and SLAinformation. This guarantees that each copy of application data withinthe system is self contained and includes all of the details required torebuild the application data.

Application Consistency is the set of actions that ensure that when acopy of the application data is created, the copy is valid, and can berestored into a valid instance of the application. This is critical whenthe business requirements dictate that the application be protectedwhile it is live, in its online, operational state. The application mayhave interdependent data relations within its data stores, and if theseare not copied in a consistent state will not provide a valid restorableimage.

The exact process of achieving application consistency varies fromapplication to application. Some applications have a simple flushcommand that forces cached data to disk. Some applications support a hotbackup mode where the application ensures that its operations arejournaled in a manner that guarantees consistency even as applicationdata is changing. Some applications require interactions with operatingsystem storage services such as VSS and VDS to ensure consistency. TheApplication Specific Module is purpose-built to work with a particularapplication and to ensure the consistency of that application. TheApplication Specific Module interacts with the underlying storagevirtualization device and the Object Manager to provide consistentsnapshots of application data.

For efficiency, the preferred embodiment of the Application SpecificModule 402 is to run on the same server as Application 300. This assuresthe minimum latency in the interactions with the application, andprovides access to storage services and filesystems on the applicationhost. The application host is typically considered primary storage,which is then snapshotted to a performance-optimized store.

In order to minimize interruption of a running application, includingminimizing preparatory steps, the Application Specific Module is onlytriggered to make a snapshot when access to application data is requiredat a specific time, and when a snapshot for that time does not existelsewhere in the system, as tracked by the Object Manager. By trackingwhich times snapshots have been made, the Object Manager is able tofulfill subsequent data requests from the performance-optimized datastore, including for satisfying multiple requests for backup andreplication which may issue from secondary, capacity-optimized pools.The Object Manager may be able to provide object handles to the snapshotin the performance-optimized store, and may direct theperformance-optimized store in a native format that is specific to theformat of the snapshot, which is dependent on the underlying storageappliance. In some embodiments this format may be application datacombined with one or more LUN bitmaps indicating which blocks havechanged; in other embodiments it may be specific extents. The formatused for data transfer is thus able to transfer only a delta ordifference between two snapshots using bitmaps or extents.

Metadata, such as the version number of the application, may also bestored for each application along with the snapshot. When a SLA policyis executed, application metadata is read and used for the policy. Thismetadata is stored along with the data objects. For each SLA,application metadata will only be read once during the lightweightsnapshot operation, and preparatory operations which occur at that timesuch as flushing caches will only be performed once during thelightweight snapshot operation, even though this copy of applicationdata along with its metadata may be used for multiple data managementfunctions.

The Service Policy Engine

FIG. 9 illustrates the Service Policy Engine 406. The Service PolicyEngine contains the Service Policy Scheduler 902, which examines all ofthe Service Level Agreements configured by the user and makes schedulingdecisions to satisfy Service Level Agreements. It relies on several datastores to capture information and persist it over time, including, insome embodiments, a SLA Store 904, where configured Service LevelAgreements are persisted and updated; a Resource Profile Store 906,storing Resource Profiles that provide a mapping between logical storagepool names and actual storage pools; Protection Catalog Store 908, whereinformation is cataloged about previous successful copies created invarious pools that have not yet expired; and centralized History Store910.

History Store 910 is where historical information about past activitiesis saved for the use of all data management applications, including thetimestamp, order and hierarchy of previous copies of each applicationinto various storage pools. For example, a snapshot copy from a primarydata store to a capacity-optimized data store that is initiated at 1P.M. and is scheduled to expire at 9 P.M. will be recorded in HistoryStore 910 in a temporal data store that also includes linked object datafor snapshots for the same source and target that have taken place at 11A.M. and 12 P.M.

These stores are managed by the Service Policy Engine. For example, whenthe user, through the Management workstation creates a Service LevelAgreement, or modifies one of the policies within it, it is the ServicePolicy Engine that persists this new SLA in its store, and reacts tothis modification by scheduling copies as dictated by the SLA.Similarly, when the Service Policy Engine successfully completes a datamovement job that results in a new copy of an application in a StoragePool, the Storage Policy Engine updates the History Store, so that thiscopy will be factored into future decisions.

The preferred embodiment of the various stores used by the ServicePolicy Engine is in the form of tables in a relational databasemanagement system in close proximity to the Service Policy Engine. Thisensures consistent transactional semantics when querying and updatingthe stores, and allows for flexibility in retrieving interdependentdata.

The scheduling algorithm for the Service Policy Scheduler 902 isillustrated in FIG. 10. When the Service Policy Scheduler decides itneeds to make a copy of application data from one storage pool toanother, it initiates a Data Movement Requestor and Monitor task, 912.These tasks are not recurring tasks and terminate when they arecompleted. Depending on the way that Service Level Policies arespecified, a plurality of these requestors might be operational at thesame time.

The Service Policy Scheduler considers the priorities of Service LevelAgreements when determining which additional tasks to undertake. Forexample, if one Service Level Agreement has a high priority because itspecifies the protection for a mission-critical application, whereasanother SLA has a lower priority because it specifies the protection fora test database, then the Service Policy Engine may choose to run onlythe protection for the mission-critical application, and may postpone oreven entirely skip the protection for the lower priority application.This is accomplished by the Service Policy Engine scheduling a higherpriority SLA ahead of a lower priority SLA. In the preferred embodiment,in such a situation, for auditing purposes, the Service Policy Enginewill also trigger a notification event to the management workstation.

The Policy Scheduling Algorithm

FIG. 10 illustrates the flowchart of the Policy Schedule Engine. ThePolicy Schedule Engine continuously cycles through all the SLAs defined.When it gets to the end of all of the SLAs, it sleeps for a short while,e.g. 10 seconds, and resumes looking through the SLAs again. Each SLAencapsulates the complete data protection business requirements for oneapplication; thus all of the SLAs represent all of the applications.

For each SLA, the schedule engine collects together all of the ServiceLevel Policies that have the same source pool and destination pool 1004the process state at 1000 and iterates to the next SLA in the set ofSLAs in 1002. Taken together, this subset of the Service Level Policiesrepresent all of the requirements for a copy from that source storagepool to that particular destination storage pool.

Among this subset of Service Level Policies, the Service PolicyScheduler discards the policies that are not applicable to today, or areoutside their hours of operation. Among the policies that are left, findthe policy that has the shortest frequency 1006, and based on thehistory data and in history store 910, the one with the longestretention that needs to be run next 1008.

Next, there are a series of checks 1010-1014 which rule out making a newcopy of application data at this time—because the new copy is not yetdue, because a copy is already in progress or because there is not newdata to copy. If any of these conditions apply, the Service PolicyScheduler moves to the next combination of source and destination pools1004. If none of these conditions apply, a new copy is initiated. Thecopy is executed as specified in the corresponding service level policywithin this SLA 1016.

Next, the Scheduler moves to the next Source and Destination poolcombination for the same Service Level agreement 1018. If there are nomore distinct combinations, the Scheduler moves on to the next ServiceLevel Agreement 1020.

After the Service Policy Scheduler has been through allsource/destination pool combinations of all Service Level Agreements, itpauses for a short period and then resumes the cycle.

A simple example system with a snapshot store and a backup store, withonly 2 policies defined, would interact with the Service PolicyScheduler as follows. Given two policies, one stating “backup everyhour, the backup to be kept for 4 hours” and another stating “backupevery 2 hours, the backup to be kept for 8 hours,” the result would be asingle snapshot taken each hour, the snapshots each being copied to thebackup store but retained a different amount of time at both thesnapshot store and the backup store. The “backup every 2 hours” policyis scheduled to go into effect at 12:00 P.M by the system administrator.

At 4:00 P.M., when the Service Policy Scheduler begins operating at step1000, it finds the two policies at step 1002. (Both policies applybecause a multiple of two hours has elapsed since 12:00 P.M.) There isonly one source and destination pool combination at step 1004. There aretwo frequencies at step 1006, and the system selects the 1-hourfrequency because it is shorter than the 2-hour frequency. There are twooperations with different retentions at step 1008, and the systemselects the operation with the 8-hour retention, as it has the longerretention value. Instead of one copy being made to satisfy the 4-hourrequirement and another copy being made to satisfy the 8-hourrequirement, the two requirements are coalesced into the longer 8-hourrequirement, and are satisfied by a single snapshot copy operation. Thesystem determines that a copy is due at step 1010, and checks therelevant objects at the History Store 910 to determine if the copy hasalready been made at the target (at step 1012) and at the source (atstep 1014). If these checks are passed, the system initiates the copy atstep 1016, and in the process triggers a snapshot to be made and savedat the snapshot store. The snapshot is then copied from the snapshotstore to the backup store. The system then goes to sleep 1022 and wakesup again after a short period, such as 10 seconds. The result is a copyat the backup store and a copy at the snapshot store, where everyeven-hour snapshot lasts for 8 hours, and every odd-hour snapshot lasts4 hours. The even-hour snapshots at the backup store and the snapshotstore are both tagged with the retention period of 8 hours, and will beautomatically deleted from the system by another process at that time.

Note that there is no reason to take two snapshots or make two backupcopies at 2 o'clock, even though both policies apply, because bothpolicies are satisfied by a single copy. Combining and coalescing thesesnapshots results in the reduction of unneeded operations, whileretaining the flexibility of multiple separate policies. As well, it maybe helpful to have two policies active at the same time for the sametarget with different retention. In the example given, there are morehourly copies kept than two-hour copies, resulting in more granularityfor restore at times that are closer to the present. For example, in theprevious system, if at 7:30 P.M. damage is discovered from earlier inthe afternoon, a backup will be available for every hour for the pastfour hours: 4, 5, 6, 7 P.M. As well, two more backups will have beenretained from 2 P.M. and 12 P.M.

The Content Addressable Store

FIG. 11 is a block diagram of the modules implementing the contentaddressable store for the Content Addressable Provider 610.

The content addressable store 610 implementation provides a storageresource pool that is optimized for capacity rather than for copy-in orcopy-out speed, as would be the case for the performance-optimized poolimplemented through snapshots, described earlier, and thus is typicallyused for offline backup, replication and remote backup. Contentaddressable storage provides a way of storing common subsets ofdifferent objects only once, where those common subsets may be ofvarying sizes but typically as small as 4 KiBytes. The storage overheadof a content addressable store is low compared to a snapshot store,though the access time is usually higher. Generally objects in a contentaddressable store have no intrinsic relationship to one another, eventhough they may share a large percentage of their content, though inthis implementation a history relationship is also maintained, which isan enabler of various optimizations to be described. This contrasts witha snapshot store where snapshots intrinsically form a chain, eachstoring just deltas from a previous snapshot or baseline copy. Inparticular, the content addressable store will store only one copy of adata subset that is repeated multiple times within a single object,whereas a snapshot-based store will store at least one full-copy of anyobject.

The content addressable store 610 is a software module that executes onthe same system as the pool manager, either in the same process or in aseparate process communicating via a local transport such as TCP. Inthis embodiment, the content addressable store module runs in a separateprocess so as to minimize impact of software failures from differentcomponents.

This module's purpose is to allow storage of Data Storage Objects 503 ina highly space-efficient manner by deduplicating content (i.e., ensuringrepeated content within single or multiple data objects is stored onlyonce).

The content addressable store module provides services to the poolmanager via a programmatic API. These services include the following:

Object to Handle mapping 1102: an object can be created by writing datainto the store via an API; once the data is written completely the APIreturns an object handle determined by the content of the object.Conversely, data may be read as a stream of bytes from an offset withinan object by providing the handle. Details of how the handle isconstructed are explained in connection with the description of FIG. 12.

Temporal Tree Management 1104 tracks parent/child relationships betweendata objects stored. When a data object is written into the store 610,an API allows it to be linked as a child to a parent object already inthe store. This indicates to the content addressable store that thechild object is a modification of the parent. A single parent may havemultiple children with different modifications, as might be the case forexample if an application's data were saved into the store regularly forsome while; then an early copy were restored and used as a new startingpoint for subsequent modifications. Temporal tree management operationsand data models are described in more detail below.

Difference Engine 1106 can generate a summary of difference regionsbetween two arbitrary objects in the store. The differencing operationis invoked via an API specifying the handles of two objects to becompared, and the form of the difference summary is a sequence ofcallbacks with the offset and size of sequential difference sections.The difference is calculated by comparing two hashed representations ofthe objects in parallel.

Garbage Collector 1108 is a service that analyzes the store to findsaved data that is not referenced by any object handle, and to reclaimthe storage space committed to this data. It is the nature of thecontent addressable store that much data is referenced by multipleobject handles, i.e., the data is shared between data objects; some datawill be referenced by a single object handle; but data that isreferenced by no object handles (as might be the case if an objecthandle has been deleted from the content addressable system) can besafely overwritten by new data.

Object Replicator 1110 is a service to duplicate data objects betweentwo different content addressable stores. Multiple content addressablestores may be used to satisfy additional business requirements, such asoffline backup or remote backup.

These services are implemented using the functional modules shown inFIG. 11. The Data Hash module 1112 generates fixed length keys for datachunks up to a fixed size limit. For example, in this embodiment themaximum size of chunk that the hash generator will make a key for is 64KiB. The fixed length key is either a hash, tagged to indicate thehashing scheme used, or a non-lossy algorithmic encoding. The hashingscheme used in this embodiment is SHA-1, which generates a securecryptographic hash with a uniform distribution and a probability of hashcollision near enough zero that no facility need be incorporated intothis system to detect and deal with collisions.

The Data Handle Cache 1114 is a software module managing an in-memorydatabase that provides ephemeral storage for data and for handle-to-datamappings.

The Persistent Handle Management Index 1104 is a reliable persistentdatabase of CAH-to-data mappings. In this embodiment it is implementedas a B-tree, mapping hashes from the hash generator to pages in thepersistent data store 1118 that contain the data for this hash. Sincethe full B-tree cannot be held in memory at one time, for efficiency,this embodiment also uses an in-memory bloom filter to avoid expensiveB-tree searches for hashes known not to be present.

The Persistent Data Storage module 1118 stores data and handles tolong-term persistent storage, returning a token indicating where thedata is stored. The handle/token pair is subsequently used to retrievethe data. As data is written to persistent storage, it passes through alayer of lossless data compression 1120, in this embodiment implementedusing zlib, and a layer of optional reversible encryption 1122, which isnot enabled in this embodiment.

For example, copying a data object into the content addressable store isan operation provided by the object/handle mapper service, since anincoming object will be stored and a handle will be returned to therequestor. The object/handle mapper reads the incoming object, requestshashes to be generated by the Data Hash Generator, stores the data toPersistent Data Storage and the handle to the Persistent HandleManagement Index. The Data Handle Cache is kept updated for future quicklookups of data for the handle. Data stored to Persistent Data Storageis compressed and (optionally) encrypted before being written to disk.Typically a request to copy in a data object will also invoke thetemporal tree management service to make a history record for theobject, and this is also persisted via Persistent Data Storage.

As another example, copying a data object out of the content addressablestore given its handle is another operation provided by theobject/handle mapper service. The handle is looked up in the Data HandleCache to locate the corresponding data; if the data is missing in thecache the persistent index is used; once the data is located on disk, itis retrieved via persistent data storage module (which decrypts anddecompresses the disk data) and then reconstituted to return to therequestor.

The Content Addressable Store Handle

FIG. 12 shows how the handle for a content addressed object isgenerated. The data object manager references all content addressableobjects with a content addressable handle. This handle is made up ofthree parts. The first part 1201 is the size of the underlying dataobject the handle immediately points to. The second part 1202 is thedepth of object it points to. The third 1203 is a hash of the object itpoints to. Field 1203 optionally includes a tag indicating that the hashis a non-lossy encoding of the underlying data. The tag indicates theencoding scheme used, such as a form of run-length encoding (RLE) ofdata used as an algorithmic encoding if the data chunk can be fullyrepresented as a short enough RLE. If the underlying data object is toolarge to be represented as a non-lossy encoding, a mapping from the hashto a pointer or reference to the data is stored separately in thepersistent handle management index 1104.

The data for a content addressable object is broken up into chunks 1204.The size of each chunk must be addressable by one content addressablehandle 1205. The data is hashed by the data hash module 1102, and thehash of the chunk is used to make the handle. If the data of the objectfits in one chunk, then the handle created is the final handle of theobject. If not, then the handles themselves are grouped together intochunks 1206 and a hash is generated for each group of handles. Thisgrouping of handles continues 1207 until there is only one handle 1208produced which is then the handle for the object.

When an object is to be reconstituted from a content handle (thecopy-out operation for the storage resource pool), the top level contenthandle is dereferenced to obtain a list of next-level content handles.These are dereferenced in turn to obtain further lists of contenthandles until depth-0 handles are obtained. These are expanded to data,either by looking up the handle in the handle management index or cache,or (in the case of an algorithmic hash such as run-length encoding)expanding deterministically to the full content.

Temporal Tree Management

FIG. 13 illustrates the temporal tree relationship created for dataobjects stored within the content addressable store. This particulardata structure is utilized only within the content addressable store.The temporal tree management module maintains data structures 1302 inthe persistent store that associate each content-addressed data objectto a parent (which may be null, to indicate the first in a sequence ofrevisions). The individual nodes of the tree contain a single hashvalue. This hash value references a chunk of data, if the hash is adepth-0 hash, or a list of other hashes, if the hash is a depth-1 orhigher hash. The references mapped to a hash value is contained in thePersistent Handle Management Index 1104. In some embodiments the edgesof the tree may have weights or lengths, which may be used in analgorithm for finding neighbors.

This is a standard tree data structure and the module supports standardmanipulation operations, in particular: 1310 Add: adding a leaf below aparent, which results in a change to the tree as between initial state1302 and after-add state 1304; and 1312 Remove: removing a node (andreparenting its children to its parent), which results in a change tothe tree as between after-add state 1304 and after-remove state 1306.

The “Add” operation may be used whenever an object is copied-in to theCAS from an external pool. If the copy-in is via the Optimal Way forData Backup, or if the object is originating in a different CAS pool,then it is required that a predecessor object be specified, and the Addoperation is invoked to record this predecessor/successor relationship.

The “Remove” operation is invoked by the object manager when the policymanager determines that an object's retention period has expired. Thismay lead to data stored in the CAS having no object in the temporal treereferring to it, and therefore a subsequent garbage collection pass canfree up the storage space for that data as available for re-use.

Note that it is possible for a single predecessor to have multiplesuccessors or child nodes. For example, this may occur if an object isoriginally created at time T1 and modified at time T2, the modificationsare rolled back via a restore operation, and subsequent modificationsare made at time T3. In this example, state T1 has two children, stateT2 and state T3.

Different CAS pools may be used to accomplish different businessobjectives such as providing disaster recovery in a remote location.When copying from one CAS to another CAS, the copy may be sent as hashesand offsets, to take advantage of the native deduplication capabilitiesof the target CAS. The underlying data pointed to by any new hashes isalso sent on an as-needed basis.

The temporal tree structure is read or navigated as part of theimplementation of various services:

Garbage Collection navigates the tree in order to reduce the cost of the“mark” phase, as described below

Replication to a different CAS pool finds a set of near-neighbors in thetemporal tree that are also known to have been transferred already tothe other CAS pool, so that only a small set of differences need to betransferred additionally

Optimal-Way for data restore uses the temporal tree to find apredecessor that can be used as a basis for the restore operation. Inthe CAS temporal tree data structure, children are subsequent versions,e.g., as dictated by archive policy. Multiple children are supported onthe same parent node; this case may arise when a parent node is changed,then used as the basis for a restore, and subsequently changed again.

CAS Difference Engine

The CAS difference engine 1106 compares two objects identified by hashvalues or handles as in FIGS. 11 and 12, and produces a sequence ofoffsets and extents within the objects where the object data is known todiffer. This sequence is achieved by traversing the two object trees inparallel in the hash data structure of FIG. 12. The tree traversal is astandard depth- or breadth-first traversal. During traversal, the hashesat the current depth are compared. Where the hash of a node is identicalbetween both sides, there is no need to descend the tree further, so thetraversal may be pruned. If the hash of a node is not identical, thetraversal continues descending into the next lowest level of the tree.If the traversal reaches a depth-0 hash that is not identical to itscounterpart, then the absolute offset into the data object beingcompared where the non-identical data occurs, together with the datalength, is emitted into the output sequence. If one object is smaller insize than another, then its traversal will complete earlier, and allsubsequent offsets encountered in the traversal of the other are emittedas differences.

Garbage Collection Via Differencing

As described under FIG. 11, Garbage Collector is a service that analyzesa particular CAS store to find saved data that is not referenced by anyobject handle in the CAS store temporal data structure, and to reclaimthe storage space committed to this data. Garbage collection uses astandard “Mark and Sweep” approach. Since the “mark” phase may be quiteexpensive, the algorithm used for the mark phase attempts to minimizemarking the same data multiple times, even though it may be referencedmany times; however the mark phase must be complete, ensuring that noreferenced data is left unmarked, as this would result in data loss fromthe store as, after a sweep phase, unmarked data would later beoverwritten by new data.

The algorithm employed for marking referenced data uses the fact thatobjects in the CAS are arranged in graphs with temporal relationshipsusing the data structure depicted in FIG. 13. It is likely that objectsthat share an edge in these graphs differ in only a small subset oftheir data, and it is also rare that any new data chunk that appearswhen an object is created from a predecessor should appear again betweenany two other objects. Thus, the mark phase of garbage collectionprocesses each connected component of the temporal graph.

FIG. 14 is an example of garbage collection using temporal relationshipsin certain embodiments. A depth-first search is made, represented byarrows 1402, of a data structure containing temporal relationships. Takea starting node 1404 from which to begin the tree traversal. Node 1404is the tree root and references no objects. Node 1406 containsreferences to objects H1 and H2, denoting a hash value for object 1 anda hash value for object 2. All depth-0, depth-1 and higher data objectsthat are referenced by node 1406, here H1 and H2, are enumerated andmarked as referenced.

Next, node 1408 is processed. As it shares an edge with node 1406, whichhas been marked, the difference engine is applied to the differencebetween the object referenced by 1406 and the object referenced by 1408,obtaining a set of depth-0, depth-1 and higher hashes that exist in theunmarked object but not in the marked object. In the figure, the hashthat exists in node 1408 but not in node 1406 is H3, so H3 is marked asreferenced. This procedure is continued until all edges are exhausted.

A comparison of the results produced by a prior art algorithm 1418 andthe present embodiment 1420 shows that when node 1408 is processed bythe prior art algorithm, previously-seen hashes H1 and H2 are emittedinto the output stream along with new hash H3. Present embodiment 1420does not emit previously seen hashes into the output stream, resultingin only new hashes H3, H4, H5, H6, H7 being emitted into the outputstream, with a corresponding improvement in performance. Note that thismethod does not guarantee that data will not be marked more than once.For example, if hash value H4 occurs independently in node 1416, it willbe independently marked a second time.

Copy an Object into the CAS

Copying an object from another pool into the CAS uses the softwaremodules described in FIG. 11 to produce a data structure referenced byan object handle as in FIG. 12. The input to the process is (a) asequence of chunks of data at specified offsets, sized appropriately formaking depth-0 handles, and optionally (b) a previous version of thesame object. Implicitly, the new object will be identical to theprevious version except where the input data is provided and itselfdiffers from the previous version. The algorithm for the copy-inoperation is illustrated in a flowchart at FIG. 15.

If a previous version (b) is provided, then the sequence (a) may be asparse set of changes from (b). In the case that the object to be copiedand is known to differ from a previous object at only a few points, thiscan greatly reduce the amount of data that needs to be copied in, andtherefore reduce the computation and i/o activity required. This is thecase, for example, when the object is to be copied in via the optimalway for data backup described previously.

Even if the sequence (a) includes sections that are largely unchangedfrom a predecessor, identifying the predecessor (b) allows the copy-inprocedure to do quick checks as to whether the data has indeed changedand therefore to avoid data duplication at a finer level of granularitythan might be possible for the difference engine in some other storagepool providing input to a CAS.

Implicitly then, the new object will be identical to the previousversion except where the input data is provided and itself differs fromthe previous version. The algorithm for the copy-in operation isillustrated in a flowchart at FIG. 15.

The process starts as an arbitrarily-sized data object in the temporalstore is provided, and proceeds to 1502, which enumerates any and allhashes (depth-0 through the highest level) referenced by the hash valuein the predecessor object, if such is provided. This will be used as aquick check to avoid storing data that is already contained in thepredecessor.

At step 1504, if a predecessor is input, create a reference to a cloneof it in the content-addressable data store temporal data structure.This clone will be updated to become the new object. Thus the new objectwill become a copy of the predecessor modified by the differences copiedinto the CAS from the copying source pool.

At steps 1506, 1508, the Data Mover 502 pushes the data into the CAS.The data is accompanied by an object reference and an offset, which isthe target location for the data. The data may be sparse, as only thedifferences from the predecessor need to be moved into the new object.At this point the incoming data is broken into depth-0 chunks sizedsmall enough that each can be represented by a single depth-0 hash.

At step 1510, the data hash module generates a hash for each depth-0chunk.

At step 1512, read the predecessor hash at the same offset. If the hashof the data matches the hash of the predecessor at the same offset, thenno data needs to be stored and the depth-1 and higher objects do notneed to be updated for this depth-0 chunk. In this case, return toaccept the next depth-0 chunk of data. This achieves temporaldeduplication without having to do expensive global lookups. Even thoughthe source system is ideally sending only the differences from the datathat has previously been stored in the CAS, this check may be necessaryif the source system is performing differencing at a different level ofgranularity, or if the data is marked as changed but has been changedback to its previously-stored value. Differencing may be performed at adifferent level of granularity if, for example, the source system is asnapshot pool which creates deltas on a 32 KiB boundary and the CASstore creates hashes on 4 KiB chunks.

If a match is not found, the data may be hashed and stored. Data iswritten starting at the provided offset and ending once the new data hasbeen exhausted. Once the data has been stored, at step 1516, if theoffset is still contained within the same depth-1 object, then depth-1,depth-2 and all higher objects 1518 are updated, generating new hashesat each level, and the depth-0, depth-1 and all higher objects arestored at step 1514 to a local cache.

However, at step 1520, if the amount of data to be stored exceeds thedepth-1 chunk size and the offset is to be contained in a new depth-1object, the current depth-1 must be flushed to the store, unless it isdetermined to be stored there already. First look it up in the globalindex 1116. If it is found there, remove the depth-1 and all associateddepth-0 objects from the local cache and proceed with the new chunk1522.

At step 1524, as a quick check to avoid visiting the global index, foreach depth-0, depth-1 and higher object in the local cache, lookup itshash in the local store established in 1502. Discard any that match.

At step 1526, for each depth-0, depth-1 and higher object in the localcache, lookup its hash in the global index 1116. Discard any that match.This ensures that data is deduplicated globally.

At step 1528: store all remaining content from the local cache into thepersistent store, then continue to process the new chunk.

Reading an object out of the CAS is a simpler process and is commonacross many implementations of CAS. The handle for the object is mappedto a persistent data object via the global index, and the offsetrequired is read from within this persistent data. In some cases it maybe necessary to recurse through several depths in the object handletree.

CAS Object Network Replication

As described under FIG. 11, the Replicator 1110 is a service toduplicate data objects between two different content addressable stores.The process of replication could be achieved through reading out of onestore and writing back into another, but this architecture allows moreefficient replication over a limited bandwidth connection such as alocal- or wide-area network.

A replicating system operating on each CAS store uses the differenceengine service described above together with the temporal relationshipstructure as described in FIG. 13, and additionally stores on aper-object basis in the temporal data structure used by the CAS store arecord of what remote store the object has been replicated to. Thisprovides definitive knowledge of object presence at a certain datastore.

Using the temporal data structure, it is possible for the system todetermine which objects exist on which data stores. This information isleveraged by the Data Mover and Difference Engine to determine a minimalsubset of data to be sent over the network during a copy operation tobring a target data store up to date. For example, if data object O hasbeen copied at time T3 from a server in Boston to a remote server inSeattle, Protection Catalog Store 908 will store that object O at timeT3 exists both in Boston and Seattle. At time T5, during a subsequentcopy from Boston to Seattle, the temporal data structure will beconsulted to determine the previous state of object O in Seattle thatshould be used for differencing on the source server in Boston. TheBoston server will then take the difference of T5 and T3, and send thatdifference to the Seattle server.

The process to replicate an object A is then as follows: Identify anobject A0 that is recorded as having already been replicated to thetarget store and a near neighbor of A in the local store. If no suchobject A0 exists then send A to the remote store and record it locallyas having been sent. To send a local object to the remote store, atypical method as embodied here is: send all the hashes and offsets ofdata chunks within the object; query the remote store as to which hashesrepresent data that is not present remotely; send the required data tothe remote store (sending the data and hashes is implemented in thisembodiment by encapsulating them in a TCP data stream).

Conversely, if A0 is identified, then run the difference engine toidentify data chunks that are in A but not in A0. This should be asuperset of the data that needs to be sent to the remote store. Sendhashes and offsets for chunks that are in A but not in A0. Query theremote store as to which hashes represent data that is not presentremotely; send the required data to the remote store.

Sample Deployment Architecture

FIG. 16 shows the software and hardware components in one embodiment ofthe Data Management Virtualization (DMV) system. The software in thesystem executes as three distributed components:

The Host Agent software 1602 a, 1602 b, 1602 c implements some of theapplication-specific module described above. It executes on the sameservers 1610 a, 1610 b, 1610 c as the application whose data is undermanagement.

The DMV server software 1604 a, 1604 b implements the remainder of thesystem as described here. It runs on a set of Linux servers 1612, 1614that also provide highly available virtualized storage services.

The system is controlled by Management Client software 1606 that runs ona desktop or laptop computer 1620.

These software components communicate with one another via networkconnections over an IP network 1608. Data Management Virtualizationsystems communicate with one another between primary site 1622 and datareplication (DR) site 1624 over an IP network such as a public internetbackbone.

The DMV systems at primary and DR sites access one or more SAN storagesystems 1616, 1618 via a fibre-channel network 1626. The servers runningprimary applications access the storage virtualized by the DMV systemsaccess the storage via fibre-channel over the fibre-channel network, oriSCSI over the IP network. The DMV system at the remote DR site runs aparallel instance of DMV server software 1604 c on Linux server 1628.Linux server 1628 may also be an Amazon Web Services EC2 instance orother similar cloud computational resource.

FIG. 17 is a diagram that depicts the various components of acomputerized system upon which certain elements may be implemented,according to certain embodiments of the invention. The logical modulesdescribed may be implemented on a host computer 1701 that containsvolatile memory 1702, a persistent storage device such as a hard drive,1708, a processor, 1703, and a network interface, 1704. Using thenetwork interface, the system computer can interact with storage pools1705, 1706 over a SAN or Fibre Channel device, among other embodiments.Although FIG. 17 illustrates a system in which the system computer isseparate from the various storage pools, some or all of the storagepools may be housed within the host computer, eliminating the need for anetwork interface. The programmatic processes may be executed on asingle host, as shown in FIG. 17, or they may be distributed acrossmultiple hosts.

The host computer shown in FIG. 17 may serve as an administrativeworkstation, or may implement the application and Application SpecificAgent 402, or may implement any and all logical modules described inthis specification, including the Data Virtualization System itself, ormay serve as a storage controller for exposing storage pools of physicalmedia to the system. Workstations may be connected to a graphicaldisplay device, 1707, and to input devices such as a mouse 1709 and akeyboard 1710. Alternately, the active user's workstation may include ahandheld device.

Throughout this specification we refer to software components, but allreferences to software components are intended to apply to softwarerunning on hardware. Likewise, objects and data structures referred toin the specification are intended to apply to data structures actuallystored in memory, either volatile or non-volatile. Likewise, servers areintended to apply to software, and engines are intended to apply tosoftware, all running on hardware such as the computer systems describedin FIG. 17.

Data Fingerprint for Copy Accuracy Assurance

FIG. 18 illustrates a method for generating a data fingerprint for anobject stored in a virtual storage pool, according to certainembodiments of the invention.

A data fingerprint is a short binary digest of a data object that may begenerated independently regardless of how the data object is stored, andis identical when generated multiple times against identical input datawith identical parameters. Useful properties for the fingerprint arethat it be of fixed size, that it be fast to generate for data objectsin all storage pools, and that it be unlikely that different dataobjects have identical fingerprints.

A data fingerprint is different from a checksum or a hash. For example,a fingerprint is taken for only a sample of the object, not the wholeobject. Obtaining a binary digest of a small percentage of the dataobject is sufficient to provide a fingerprint for the whole data object.Since a data fingerprint only requires reads and computes on a smallpercentage of data, such fingerprints are computationally cheap orefficient compared to a checksum or hash.

These data fingerprints are also different in that a single data objectmay have multiple fingerprints. Over the life of a data object, multiplefingerprints are stored with the object as metadata, one per generationof the data object. The multiple fingerprints persist over multiplecopies and generations of the data object.

Data fingerprints may be used to compare two objects to determinewhether they are the same data object. If the data fingerprints for twoobjects differ, the two objects can definitively be said to bedifferent. As with checksums, data fingerprints may thus be used toprovide a measure or test of data integrity between copied or storedversions of a data object. Two data objects with the same datafingerprint may not necessarily be the same object.

As multiple fingerprints are taken of an object, data fingerprints maybe used to compare two objects with increasing reliability. Afingerprint match on a subsequent revision increases confidence that allthe previous copies were accurate. If a fingerprint does not match, thisindicates that either this copy or previous copies were not accurate.With each next generation of the copy, a new fingerprint may be computedand validated against the corresponding fingerprint for that generationor revision.

If two data objects are compared by comparing their corresponding datafingerprints, and the corresponding fingerprints do not match, it ispossible to conclude with certainty that the two data objects aredifferent. However, if the corresponding fingerprints do match, it isnot possible to conclude that the corresponding data objects arenecessarily identical. For example, given two data objects thatrepresent a digital photograph or image data, taking a data fingerprintmay include taking a checksum or binary digest of a portion of eachimage. Comparing the two data objects based on a single portion of eachimage would not necessarily indicate that they are the same image.However, if multiple portions of the two images are identical, it ispossible to conclude with increased certainty that the two images arethe same image.

The calculation of a data fingerprint may require a selection function,which may be dynamic, that selects a subset or portion of the input dataobject. Any such function may be used; one specific example is describedbelow in connection with certain embodiments. The function may selectsmall portions of the data object that are spread out throughout theentirety of the data object. This strategy for selecting portions ofdata is useful for typical storage workloads, in which large chunks ofdata are often modified at one time; by selecting a relatively largenumber of non-contiguous portions or extents of data that are widelydistributed within the data object, the selection function increases theprobability that a large contiguous change in the data object may bedetected. The function may change over time or may base its output onvarious inputs or parameters.

The choice of a selection function should ideally be done with anawareness of the content of a data object. Portions of the data objectthat are likely to change from generation to generation should beincluded in the fingerprint computation. Portions of the data objectthat are static, or tend to be identical for similar objects should notbe included in the fingerprint. For example, disk labels and partitiontables, which tend to be static should not generally be included in thefingerprint, since these would match across many generations of the sameobject. The tail end of a volume containing filesystems often tend to beunused space; this area should not be used in the computation of thefingerprint, as it will add computational and IO cost to thefingerprint, without increasing its discriminating value.

It is apparent that as the total size of the subset selected by theselection function increases, the probability that the data fingerprintcaptures all changes to the data increases, until the subset is equal tothe whole data object, at which time the probability is 1. However, theselection function may balance the goal of increased probability ofdetecting changes with the goal of providing a consistently-fastfingerprinting time. This tradeoff is expressly permitted, as thedisclosed system allows for multiple data fingerprints to be taken ofthe same data object. Multiple fingerprints can provide the increasederror-checking probability as well, as when the number of fingerprintsbecomes large, the number of un-checked bytes in the data objectdecreases to zero.

A data fingerprinting function may operate as follows, in someembodiments. A data object, 1810, is any file stored within any virtualstorage pool, for example a disk image stored as part of a dataprotection or archiving workflow. Start, 1820, is a number representingan offset or location within the file. Period, 1830, is a numberrepresenting a distance between offsets within the file. Data Sample,1840, is a subset of data from within the data object. Chunk checksums,1850, are the result of specific arithmetic checksum operations appliedto specific data within the file. The data fingerprint, 1860, is asingle numerical value derived deterministically from the content of thedata object 1810 and the parameters start 1820 and period 1830. Otherparameters and other parametrized functions may be used in certainembodiments.

The data samples 1840 are broken into fixed length chunks, in thisillustration 4 KB. For each chunk a chunk checksum 1850 is calculatedfor the data stream, where the checksum includes the data in the chunkand the SHA-1 hash of the data in the chunk. One checksum algorithm usedis the fletcher-32 method(http://en.wikipedia.org/wiki/Fletcher's_checksum). These chunkchecksums are then added together modulo 2⁶⁴, and the arithmetic sum ofthe chunk checksums is the data fingerprint 1860, parameterized by Startand Period. Other methods for combining the plurality of hash values orchecksums into a single hash value may be contemplated in certainembodiments of the invention. A single hash value is preferred forsimplicity. It is not necessary for the single hash value to revealwhich data subsets were used in producing the chunk checksums.

In other embodiments, a data fingerprint may be performed using otherfunctions that focus on interesting sections of a data object, wherecertain sections are determined to be interesting using various means.Interesting sections may be sections that are determined to changefrequently, or that are likely to change frequently. A prioriinformation about the content of the data object or the frequency ofchange of parts of the data object may be used. For example, when thesystem detects that a data object is a disk image, the system may ignorethe volume partition map, as the partition map rarely changes. Asanother example, if the system knows that it is storing a Microsoft Worddocument, and that the headers of the document are unlikely to change,it may designate the body and text areas of the document as“interesting,” and may choose to fingerprint those areas. Fingerprintingan “interesting” area may be performed in a manner similar to FIG. 18,in some embodiments, where the data samples are chosen by firstidentifying interesting data areas and then identifying areas to samplewithin the interesting data areas using an algorithm that generates asparse subset of the interesting data areas.

In a preferred embodiment, the described fingerprinting algorithm has avery small overhead, and thus fingerprinting may be performed often.However, in cases such as when a pool includes offline tapes,fingerprinting all data may not have a reasonable overhead.

FIG. 19 illustrates how the data fingerprint is used for assurance ofaccuracy in copy operations, according to certain embodiments of theinvention.

In addition to the operations described above for the Object Manager501, an additional operation is defined: that of generating afingerprint for a data object, given a set of parameters (operation1930). Every data object that is cataloged is fingerprinted and thefingerprint is stored with all other metadata.

When an object is cataloged, Object Manager 501 may make a request for afingerprint on a data object to each pool. The first fingerprint isgenerated at the first storage-optimized pool or snapshot pool andstored in the catalog store. After a data object is first copied intothe Performance Optimized Pool 508 using the lightweight snapshotoperation, the data movement requestor 912 generates a set of parametersfor a fingerprint, and uses them to request a fingerprint (operation1910) from the object manager 501. In turn the object manager requests afingerprint from the performance optimized pool (operation 1940). Theperformance optimized pool is capable of generating the fingerprint. Ina preferred embodiment, every pool managed by pool manager 504 iscapable of generating a fingerprint. The new fingerprint is stored intothe protection catalog store 908, along with other metadata for theobject as described above (operation 1930).

After any subsequent copy request (operation 1910), such as copy tocapacity optimized pool (operation 1950), the fingerprint is requestedfrom the target pool for the target object (operation 1930, operation1960). Once generated, the stored fingerprint is then passed on to eachsubsequent pool, where the newly calculated fingerprint is then verifiedagainst the stored fingerprint to assure that copying errors have notoccurred. Each subsequent pool may calculate the fingerprint again andvalidate the calculated fingerprint against the stored fingerprint.

To generate a fingerprint, the data object 1810 is sampled at regularintervals defined by Start 1820 and Period 1830 parameters. Each sampleis a fixed size, in this illustration 64 KB. In one embodiment, theparameter Period is chosen such that it is approximately 1/1000 of thesize of the data object, and Start is chosen between 0 and Periodaccording to a pseudo-random number generator.

For each new revision or generation of the data object, the startparameter may be modified, resulting in a data fingerprint of adifferent region of the data object. The object size, however, changesonly in certain circumstances. If the object size stays constant theperiod stays constant. If the object size changes the period will changeas well. A period of 1/1000 (0.001) or another small fraction may beselected to ensure that calculating a fingerprint will take a small timeand/or a constant time. Note that depending on the function used togenerate the subset of the data object used for the data fingerprintingoperation, other parameters may be modified instead of the startparameter. The result is to cause the data fingerprint to be generatedfrom a different region of the data object, such that cumulative datafingerprints result in fingerprinting of an increasing proportion of thedata object over time.

Multiple generations of a data object may be created as a result ofinteractions with service level agreements (SLAs), as describedelsewhere in the present disclosure. For example, given a SLA thatschedules a snapshot operation once every hour, an additional generationof a data object will be created every hour. For each additionalgeneration, a new data fingerprint is created and sent. If the dataobject has not changed from the previous generation to the currentgeneration, the data itself need not be sent, but a fingerprint is sentto the target data pool regardless, to incrementally increase theprobability that the sparse data fingerprinting operation has capturedall changes to the data throughout the data object.

As different storage pools may support different operations, thefingerprint operation may be supported by one or more storage pools inthe system. The pools are brokered by the operation manager such as PoolRequest Broker 602. In a preferred embodiment the fingerprint operationis supported by all pools.

Fingerprinting remains with the metadata for the lifetime of the dataobject. This allows fingerprinting to also be used during restore aswell as during copy or other phases of data storage, access andrecovery, which provides true end-to-end metadata from a dataperspective. Fingerprinting during restore is performed as follows. Whena restore operation is requested by Object Manager 501, a fingerprintoperation may take place on the restored data. This fingerprintoperation may take place before or after the restore operation. By usingthe fingerprint operation, all previously-stored revisions of the dataobject are used to verify the currently-restored copy of the data,according to the fingerprint verification method described above. Thisleverages incremental knowledge in a way different from that of typicalI/O path CRC protection.

As disclosed above, each copy of an object between virtual storage poolsis incremental, transferring only data from the source object known tobe absent in the target pool. It follows from this that any errors incopying in one generation of an object will still be present insubsequent generations. Indeed such errors may be compounded. The use ofa data fingerprint provides a check that copies of an object indifferent virtual storage pools have the same data content.

The choice of data fingerprint method also controls the level ofconfidence in the check: as the Period (1830) is made smaller, the costof generating the fingerprint goes up, as more data needs to be readfrom the pool, but the chance of generating a matching fingerprintdespite the data containing copying errors decreases.

However, for successive generations of a single object, different valuesmay be used for the parameter Start (1820). This ensures that withrepeated copying of successive generations of single object, the chancethat any copying error might not be eventually caught reducesasymptotically to zero.

Copy Accuracy Assurance

FIG. 20 illustrates file set 2004 created and used by application 2002such as Microsoft SQL Server, Microsoft Exchange Server etc. to saveapplication data. Backup application copies the file set as a backupdata set 2008 to backup storage 2010 during backup operation 2012.

Applications such as Microsoft SQL Server, Microsoft Exchange Serveretc. store application data in a set of files on primary storage 2016(e.g., production storage). The format of each file, number of filesused by each application and content of each file differ for eachapplication. Backup application needs to copy the files created and usedby application to file system on backup storage 2010.

To reduce the time required for backup operation and to reduce thestorage consumed by backup data, backup application typically performsincremental backup. During incremental backup, only the changed blocksin each changed file are copied to backup storage. Incremental backup ofapplications is described in detail herein and in U.S. ProvisionalApplication No. 61/905,346, filed on Nov. 18, 2013, entitled“Computerized Methods and Apparatus for Incremental Database BackupUsing Change Tracking,” the disclosure of which is herein incorporatedin its entirety.

FIG. 21 illustrates incremental copy of each file performed by backupapplication during backup process. For each application data file 2102,backup application identifies changed blocks 2106 within the file sinceprevious backup. It then copies 2108, 2110, 2112, 2114, 2116 the changedblocks 2106 to copy of the file on backup storage 2104. Note thatchanged blocks within a file can appear anywhere in the file and candiffer in sizes.

If backup application for some reason does not identify changed blocksin a file correctly, or fails to copy changed blocks to copy of file onbackup storage correctly, resultant backup image can be corrupt due toincomplete copy. Such a corrupt backup may not be restorable. Incompletecopies are described in detail in above.

Copy Accuracy Assurance mechanism as described in some embodiments ofthe present disclosure help prevent corrupt backups due to incorrectchange block tracking information or failure to copy changed blocks.

FIG. 22 illustrates one implementation of Copy Accuracy Assurancemechanism. In this implementation, random 64 KB blocks are read fromfrozen copy of file from Production Storage 2202 and backup copy of fileon Backup Storage 2204 and compared 2214, 2216, 2218, 2220, 2222 toverify accuracy of copy. In some embodiments, these blocks are spaceduniformly throughout the length of the file. Starting offset 2206 2210points to the beginning of first random block selected for verification.Period 2208 2212 is the spacing between blocks selected forverification. If any of the blocks between source and target aredifferent, the two files are deemed as different. Uniform spacing isjust one of the schemes that can use be used for selecting blocks forverification. Other schemes may be chosen depending on the applicationand content of files.

In this implementation, data fingerprint for file is not stored bybackup application. Instead random sample is generated, compared andthrown away each time Copy Accuracy verification is invoked.

Each fingerprint verification operation can use a different set ofrandom blocks for comparing source and target files. Each successfulverification increases confidence in accuracy of copy.

Data fingerprinting is performed even if backup application detects nochanges to application data files. This scheme increases confidence thatthe source and backup files are identical.

There are situations where backup application might want to computefingerprint from the random sample of blocks instead of just sparselycomparing random sample of blocks and save computed fingerprint forfuture use as part of backup metadata. Saving computed fingerprint for afile allows backup application to verify the contents of backed up fileswhen the copy is accessed for any purpose.

FIG. 23 is a flowchart illustrating a fingerprint verification process.During fingerprint verification, source file on Production Storage andtarget file on backup storage are compared to verify accuracy of copy.Backup application opens the source file at step 2302 and opens targetfile at step 2304. It then selects the random start offset forverification at step 2306. At step 2308, backup application selects thespacing between two consecutive blocks to read for verification. Thenumber of blocks selected for verification depends on the size of filebeing verified. Backup application then determines if there are moreblocks to compare in the file 2310. If there are more blocks to compare,backup application then reads block from source file at step 2312 andreads the corresponding block in target file at step 2314. At step 2316,backup application compares the source and target blocks. If the blocksare different, the fingerprint verification is considered as failed andfurther verification is stopped. If the compared blocks are the samethen steps 2312 through 2318 are repeated until all selected blocks arecompared. If all blocks selected for random verification are the same,the files are considered the same.

A different start offset is selected at step 2306 for each fingerprintverification operation. This increases the confidence in accuracy ofcopy.

Incremental Backup Using Change Tracking

Incremental backup of a database generally involves backing up a fullcopy of the database and then backing up just the changes to thedatabase since the last full or incremental backup. Incremental backupcan reduce the amount of data that needs to be backed up during eachbackup operation, storage space consumed by the backup image on backupdevice and/or the time required to backup the database.

FIG. 24 depicts a traditional incremental backup solution for databases.Database Application 2401 consumes primary storage 2402 for savingdatabase files. Periodic database backup 2404 is performed by a backupapplication to a backup device 2403 that is different from the primarystorage 2402. The backup device 2403 is typically external to thedatabase server such as an external disk or tape device.

Backup applications typically perform a full backup of a databasefollowed by series of incremental backups, each backup occurring at adistinct backup time 2406. The layout 2405 of backup image 2407 onbackup device 2403 consists of a full copy of the database for the fullbackup 2408. Each subsequent incremental backup is stored as a separatebackup image 2409, 2410, 2411, 2412, 2413 in an incremental backupformat specific to the application that is being backed up on the backupdevice. The incremental backup format is usually different from a nativefile format for the database. So the incremental backup cannot be usedinterchangeably with database files.

This traditional approach to incremental backup of databases can havecertain disadvantages. Restoring a database to a point in time staterequires first restoring the last full backup and then applyingsubsequent incremental backups until the database is rolled forward to adesired point in time. This can increase the time required to restore adatabase. The larger time required for restore may mean a largerdowntime for a business in the case of a disaster.

Another potential disadvantage is that restore using the backup takenwith a traditional approach often requires that all incremental backupsbetween the full backup and desired restore point are available at thetime of restore. If any of the incremental backups are missing, thedatabase cannot be restored.

Another potential disadvantage is that, to reduce the time required torestore the database and to avoid the need to retain the first fullbackup for indefinite period of time, full backup of database oftenneeds to be performed periodically. Full database backup is slow andconsumes the same

The techniques described herein provide for monitoring changes made to afile using a change tracking driver. An incremental backup can begenerated (e.g., for a database) using the change tracking driver byfirst copying all data for the first backup of the data, and then usingthe change tracking driver to only copy changed data since the firstbackup. Each incremental backup can be a stand-alone backup such that itincludes a reference to the original data in the first backup, as wellas references to all changes since the first backup for the respectiveincremental backup. Further, the data (and changes) can be copied innative form to provide for efficient restoration.

In the present disclosure, an approach is presented for incrementalbackup of databases (e.g., on servers running a Windows operatingsystem) using a change tracking driver. This approach can overcome manyof the disadvantages of a traditional incremental backup approach, aswill be appreciated more fully herein.

FIG. 25 is an exemplary diagram illustrating an incremental backup usinga change tracking driver, according to some embodiments. FIG. 25includes database application 2501 consuming primary storage 2502 forsaving database files. Backup application backs up 2504 databaseapplication 2501 to backup device 2503. Change tracking driver 2514 isinstalled on the server that is running the database application 2501.Change tracking driver 2514 monitors changes to database files andrecords changes to those database files in a change-tracking bitmap (notshown, but discussed further herein).

In this exemplary approach shown in FIG. 25, a first backup (e.g.,backup 2508) for each database file involves a full copy of the file tothe backup device at a first backup time 2506. Subsequently, the changesto each database file, at subsequent backup times 2506, are captured ina native database format using the change-tracking driver 2514. Duringeach incremental backup operation (e.g., incremental backups 2509, 2510,2511, 2512, and 2513), the changed blocks within each database file arecopied to copy of the file made during the previous backups on thebackup device. Once changes to all files are copied to the backupdevice, a copy-on-write snapshot of the backup device is created tocapture the point-in-time state of each database file. Each snapshot ofbackup device results in a full independent backup of the database 2509,2510, 2511, 2512, and 2513 (e.g., because each snapshot includesreferences or pointers to both the original data and all changed data).

In this exemplary approach, only the changes made to the database filessince the last backup are copied to the backup device during eachincremental backup. However, each resultant backup image 2507 is a fullpoint-in-time copy of the database in application native format as shownin backup layout 2505. Backup images 2508, 2509, 2510, 2511, 2512, and2513 are all full copies of the database in the native format of theapplication that can be restored instantaneously. In some embodiments,the backups after the first initial backup (e.g., 2509-2513) referencethe first backup (e.g., backup 2508) rather than copying all of the datafor each incremental backup.

Each backup image on the backup device is a full, independent copy ofthe database in native database format and can be deleted withoutaffecting any other backup images. For example, if backup 2510 isdeleted, application 2501 can still be restored to any of backups 2511,2512, or 2513. For example, backup 2511 includes a pointer to theoriginal full backup data 2508, and also includes all of the changeddata that was modified since backup 2508 (e.g., the changed data that isincluded in both backup 2509 and backup 2510, as well as any additionaldata that changed since backup 2510). Therefore, a restore operationsimply loads the original data that is pointed to by the backup 2508,and merges in the changed data stored in backup 2511.

FIGS. 26A and 26B are exemplary flow charts illustrating a computerizedmethod for incremental backup using a change tracking driver, accordingto some embodiments. Referring to FIG. 26A, a backup operation for adatabase starts at step 2610. The backup application identifies all datafiles for the database being backed up at step 2612. For each data file2614, the backup application checks at step 2616 if the data file can bemonitored for change tracking, and if so, the backup application startsa change tracking bitmap for the file at step 2618. The change trackingbitmap started in this step will record changes to the database filestarting its creation time and will be used for the next backup.Referring to step 2616, criteria for change tracking can be pre-set foreach file or file type. For example, a flag can be set for a file toindicate whether the file is eligible for change tracking based on filesize and/or the type of file system. For example, it may not be worthtracking changes for a small file. As another example, it may only bedesirable to backup a database for a particular type of file system(e.g., Microsoft Windows).

The backup application creates a VSS snapshot of all volumes thatcontain data files for the database at step 2620. U.S. patentapplication Ser. No. 13/920,976, entitled “System and Method forProviding Intra-Process Communication for an Application ProgrammingInterface” addresses an example of a VSS snapshot process, which ishereby incorporated by reference herein in its entirety.

Referring to FIG. 26B, the method proceeds to step 2622 where, for eachdata file, the backup application checks if there is a change-trackingbitmap available that was created during the previous backup at step2624. If the change-tracking bitmap is available and the bitmap isreliable (e.g., all the changes to the monitored file are recorded inthe bitmap on disk successfully, monitoring process didn't terminateabnormally, etc.), the backup application retrieves the change-trackingbitmap for the file and copies each changed block from the database filefrom VSS snapshot to the backup device at step 2626. Once the changedblocks are copied for a data file to the backup device, the backupapplication deletes the change tracking bitmap used for this backup atstep 2628.

Referring back to step 2624, if a change-tracking bitmap is notavailable for a file (e.g., if the file has not yet been backed up) orthe contents of change-tracking bitmap are unreliable (e.g., all thechanges to the monitored file are not recorded in the bitmap on disksuccessfully, monitoring process terminated abnormally etc.), then theentire file is copied to the backup device at step 2630. Once all filesare copied to backup device, the backup application creates a snapshotof the backup device 2632 to preserve the point-in-time state of thebackup device. The backup operation completes at step 2634 uponsuccessful creation of snapshot.

FIG. 27 is an exemplary table illustrating the lifecycle of a changetracking bitmap, according to some embodiments. The table describes thelifecycle of change tracking bitmap(s) for each database file. Duringthe first backup at backup time D1, a new bitmap with id 1 is createdfor a database file. This bitmap is used for tracking changes to thedatabase file being monitored from the time of creation of the bitmap.During this backup, only one bitmap is in existence for the file. Sincethe bitmap was just created, it will not be used for incremental backupsince there are not yet any changes to the file. Instead, the contentsof the entire database file will be copied to the backup device.

At backup time D2, a second change tracking bitmap with bitmap id 2 willbe created for the database file. During the backup operation, there arenow two change tracking bitmaps in existence, one started at backup timeD1 and the other started at backup time D2. Bitmap with id 1 willcontain the record of changes made to the database file since the lastbackup made at time D1 until time D2, and the bitmap with id 2 is emptysince there have not yet been any changes to the database file since D2.The bitmap with id 1 will be used for an incremental backup at backuptime D2. Once the incremental backup is successful, the bitmap with id 1will be deleted since the bitmap with id 2 is being used to tracksubsequent changes to the database file.

Each subsequent backup at times D3, D4, D5 and D6 will create a newchange tracking bitmap with ids 3, 4, 5 and 6, respectively, fortracking changes to the database between the respective backup and thenext backup. The new bitmap created at the beginning of last backup willbe used for incremental backup of the database file during the followingbackup operation.

FIG. 28 is an exemplary diagram illustrating a change tracking driverdeployment, according to some embodiments. As an illustrativenon-limiting example, assume that the change tracking kernel mode driveris installed on a server 2801 with a Windows operating system (one ofskill in the art can appreciate that any type of computer and/oroperating system can be used without departing from the spirit of thetechniques described herein). The change-tracking driver is composed oftwo components, a Kernel Mode (KM) windows min-filter driver 2807located in kernel mode 2802 and User Mode (UM) 2803 service 2806 locatedin user mode 2803.

KM driver 2807 interacts with the filter manager 2822 in the windows I/Ostack 2805, which includes the I/O manager 2820, the filter manager2822, the File System driver 2824, and the storage driver 2826. Anytimedatabase application 2809 modifies a file, the filter manager 2822intercepts the I/O request and sends it to KM driver 2807. The KM driver2807 checks if the file being modified needs to be monitored andnotifies UM Service 2806, across the UM/KM Boundary 2804, if the file isbeing monitored. UM service 2806 is responsible for serving requestsfrom backup application 2808 and manipulating change-tracking bitmaps inresponse to notifications for KM driver 2807. Backup application 2808 isresponsible for performing actions necessary for backing up a desireddatabase.

UM service 2806 records changes made to a monitored database file in achange-tracking bitmap. Each change-tracking bitmap is saved on the diskat a location chosen by backup application 2808. The on-disk copy of thebitmap is memory-mapped into UM service process 2806 for recordingchanges to the file. The modified bitmap is saved on the disk once thechanges are recorded in the bitmap.

In some embodiments, all communication between KM driver 2807 and UMserver 2806 is asynchronous to avoid, for example, database I/Operformance degradation.

FIG. 29 is an exemplary diagram illustrating a change tracking bitmapdata structure 2900, according to some embodiments. The bitmap datastructure 2900 consists of a header 2902 that describes the bitmap andits state. Immediately following the header 2902 is a bitmap 2904 thatincludes a record of changes made to the database file being monitoredusing this bitmap 2900.

The header 2902 can include various fields. The header 2902 in thisexample includes the following fields: MagicNumber 2910, HeaderSize2912, Version 2914, VolumeGuid 2916, FileId 2918, LastTimeOpened 2920,LastTimeClosed 2922, LastTimeUpdated 2924, BlockSize 2926, ClosedClean2928 and UntrackedChanges 2930. Field MagicNumber 2910 is a uniqueidentifier used to indicate that a bitmap is created by this changetracking driver. Field HeaderSize 2912 indicates the size of the bitmapheader. Field Version 2914 indicates the version of the bitmap format.Field VolumeGuid 2916 indicates the file system volume the monitoredfile resides on. Field FileId 2918 is the file identifier for themonitored file. Field LastTimeOpened 2920 indicates the last time thebitmap file on disk was opened for reading or modification. FieldLastTimeClosed 2922 indicates the time the bitmap file on disk wasclosed the last time. Field LastTimeUpdated 2924 indicates the time thebitmap was updated the last time. Field BlockSize 2926 indicates thegrain size used for tracking changes to monitored file.

There are two fields in this example, ClosedClean 2928 andUntrackedChanges 2930, in bitmap header 2902 that help determine thereliability of each bitmap. Field ClosedClean 2928 indicates that thebitmap was saved on the disk successfully after recording file changesin the bitmap. This field is set to FALSE when a bitmap is opened formodification. The field is set to TRUE when the bitmap is saved to thedisk. Backup application can discard a bitmap as unreliable if thisfield is set to FALSE when a bitmap is retrieved for backup operation.

Field UntrackedChanges 2930 indicates whether there were changes made tothe file being monitored that were not recorded in the change-trackingbitmap. This can happen for a number of reasons including UM servicecrash, malfunctioning of software etc. Backup application can discard abitmap as unreliable if this field is set to TRUE when a bitmap isretrieved for backup operation.

FIG. 30 is an exemplary flow chart illustrating a computerized methodfor starting change tracking for a file, according to some embodiments.UM driver service starts at step 3010. It loads KM driver at startup atstep 3012 and waits for client requests and notifications from KM driverat step 3014. KM driver is loaded 3013, and registers callbacks withFilter Manager so that it gets notified when certain file operationssuch as file creation, deletion, modification etc. occur at step 3016.KM driver then waits to receive notifications from either the UM serviceor the Filter Manager at step 3028.

When a request to start a new change-tracking bitmap for a file isreceived from backup application at step 3018, UM service computes thesize of the new bitmap using the size of file being tracked and theblock size backup application requested for change tracking at step3020. It then creates a new bitmap and saves it to a file on disk atstep 3022. If this is the first bitmap started for the file, UM serviceupdates the list of tracked file at step 3024 and notifies the KM driverthat a new file tracking has been started at step 3026. UM service alsobegins tracking a response 3032. At step 3030, in response to thenotification, KM driver associates a context with the file being trackedto indicate file-tracking status. KM driver then relies on theassociated context to determine if file change notifications for thefile should be sent to the UM service. Tracked file information can bestored in a file on a disk or in windows registry.

FIG. 31 is an exemplary flow chart illustrating a computerized methodfor terminating change tracking for a file, according to someembodiments. This operation can be performed, for example, when adatabase no longer needs to be backed up or when a backup applicationwants to discard all change tracking bitmaps for a file. When astop-tracking request for a file is received from a backup applicationat step 3120, UM service first removes all in-memory bitmaps for thefile at step 3122. It then deletes all on-disk bitmaps for the file atstep 3124. UM service removes the file from list of tracked file at step3126 and notifies KM driver, which waits for notifications from the UMService 3028, that the file tracking has been stopped at step 3128. Inresponse to notification from UM service, KM driver deletes the contextpreviously associated with the tracked file at step 3132 and UM ModeDriver Service stops monitoring the file 3134.

FIG. 32 is an exemplary flow chart illustrating file modificationnotifications from the system, according to some embodiments. FilterManager notifies KM driver when a file gets modified at step 3222. KMdriver checks the context associated with the file and notifies UMService 3224 of the change if the file is monitored at step 3226. Inresponse to file change notification from KM driver, UM service loadsall change-tracking bitmaps for the file in memory at step 3228. UMservice computes the bits to set in bitmap using offset and lengthreceived with file change notification and the block size used formonitoring the file. UM service then sets the bits in bitmap at step3230 and saves the modified bitmaps to disk at step 3232. This completesthe processing of file change notification and UM service waits foradditional notifications from KM driver.

FIG. 33 is an exemplary flow chart illustrating a computerized methodfor deleting a change tracking bitmap, according to some embodiments. UMservice receives request to delete change-tracking bitmap from backupapplication at step 3322. In response to the request, UM service removesthe bitmap from memory and deletes the bitmap on disk at step 3324. UMservice then sends response to backup application at step 3326 andstarts waiting for new requests.

This method can also be used for backing up Virtual Machines hosted by ahypervisor such as a Microsoft Hyper-V Server by installing the backupapplication and change tracking driver on the hypervisor. In someembodiments of this configuration, the backup application and the changetracking driver are installed on the hypervisor and not inside any ofthe hosted Virtual Machines.

Storage allocated to Virtual Machines are Virtual Hard Disk files (forexample VHD, VHDx, AVHDx files for Hyper-V Server) hosted on native filesystem on the hypervisor. These files can be backed up in their nativeformat using change tracking driver. These Virtual Hard Disk files canbe presented to any hypervisor to access and retrieve data that wasbacked up previously.

In addition to the Virtual Hard Disk files, configuration files for eachVM can be backed up. This allows for reconstructing an exact clone of aVirtual Machine when access to the previously backed up state of VirtualMachine is needed.

Backup of Virtual Machines in native format can allow for near instantrestore and/or cloning of previously backed up VM by presenting a copyof previously backed up Virtual Disk Files to any hypervisor. This canreduce impact on business in case of disasters and provides for businesscontinuity.

FIG. 34 is an exemplary diagram illustrating a change tracking driverdeployment on Hyper-V Server, according to some embodiments. As anillustrative nonlimiting example, assume that change tracking kernelmode driver is installed on Hyper-V Server 2801 (one of skill in the artcan appreciate that any type of computer and/or operating system can beused without departing from the spirit of the techniques describedherein). The change tracking driver is composed of two components, aKernel Mode (KM) windows mini-filter driver 2807 and User Mode (UM)service 2806.

KM driver 2807 interacts with filter manager 2822 in the windows I/Ostack 2805, which includes the I/O Manager 2820, the filter manager2822, the File System driver 2824, and the storage driver 2826. Anytimeany data is written to one of the disks in Virtual Machines 3418, 3420,a Virtual Hard Disk file gets modified. The filter manager in thehypervisor I/O stack intercepts the modification request and sends it toKM driver 2807. The KM driver checks if the Virtual Disk file needs tobe monitored and notifies UM Server 2801 if needed. In response to I/Onotification from KM driver 2807, UM service 2806 updates the changetracking bitmap for the file. Backup application 2808 is responsible forperforming actions necessary to backup the desired Virtual Machine.

In some embodiments, communication between KM driver 2807 and UM service2806 is performed asynchronously to avoid, for example, I/O performancedegradation for the Virtual Machine.

Some embodiments of the present disclosure describe a computerizedmethod of creating an incremental backup of application data by creatinga snapshot associated with a current incremental backup of a data fileusing a change tracking bitmap such that a data file associated with thecurrent incremental backup can be restored from just the snapshotassociated with the current incremental backup and an initial backupwithout needing to access one or more previously generated incrementalbackups of the data file, each created at an earlier point in time thanthe point in time for the current incremental backup, the methodcomprising: receiving, by a computing device, a data file to bemonitored by the computing device; identifying, by the computing device,a prior change tracking bitmap associated with the data file, the priorchange tracking bitmap comprising data indicative of changes made sincea backup created at an earlier point in time than the point in time forthe current incremental backup; determining, by the computing device,blocks of data of the data file changed since the prior change trackingbitmap for the prior incremental backup; transmitting, by the computingdevice, to a backup device blocks of data of the data file changed sincethe prior change tracking bitmap for the prior incremental backup; andcreating, by the computing device, a copy-on-write snapshot of thebackup device to capture a point-in-time state of the data file, suchthat the data file associated with the current incremental backup can berestored from just the snapshot associated with the current incrementalbackup and the initial backup without needing to access one or morepreviously generated incremental backups of the data file, each createdat an earlier point in time than the point in time for the currentincremental backup.

In some embodiments, the backup device includes data indicative of allchanges made for each of a set of backups created at an earlier point intime other than the point in time for the current incremental backup.The method can further include transmitting instructions, from acomputing device, to a backup application to create a current changetracking bitmap associated with the current incremental backup fortracking changes to the data file after the current incremental backup.The method can further include deleting, by the computing device, theprior change tracking bitmap after creating the current change trackingbitmap. In some embodiments, if the change tracking bitmap does notexist, the method can further include transmitting instructions to thebackup application to copy the entire data file to create an initialbackup of the data file and to create an initial change tracking bitmapfor tracking changes made to the data file after generation of theinitial backup. In some embodiments, if the data file has a prior changetracking bitmap the method can further include determining if the priorchange tracking bitmap is reliable. Receiving, by a change trackingdrive, a data file to be monitored further can further includedetermining if the data file is eligible for change tracking. In someembodiments, the data file comprises at least one of a database file anda virtual file. In some embodiments, the virtual file comprises at leastone of a configuration file and a virtual hard disk file for a virtualmachine, facilitating near instant restore and cloning of previouslybacked up virtual machines. In some embodiments, the backup created atan earlier point in time comprises a backup created most recent in timeto the current incremental backup.

Data Cloning

Backup images of an application are often created based on a pre-definedservice level agreement (“SLA”) that defines the frequency of the backup(e.g., daily, weekly, monthly, etc.) and other parameters, such as theapplication source, the backup target, etc. Over time, a backup SLAoften results in multiple backup images being created for theapplication. Differences between backup images are often captured as aset of bitmaps. It is often desirable to instead generate a live copy ofthe application, such as for testing and development purposes. It isfurther often desirable to remove sensitive data (e.g., confidentialinformation, such as social security numbers, account numbers,passwords, etc.) from the live copy of the application before using thelive copy for testing and development.

The disclosed techniques enable creating space-efficient,policy-independent copies of backup images that can be leveraged toprovide a mechanism for rapid test-and-development capabilities,referred to herein as live clone images. The live clone is an exact copyof a backup image. It is a “live” copy of production data such that dataas it is being stored can be mounted/executed without needing to changethe data format, compared to storing deduplicated data or snapshots,which cannot be mounted/executed as stored. A synthesized bitmap can becreated based on bitmaps associated with subsequent backups since thecreation of the live clone. The synthesized bitmap can be used torefresh the live clone by only copying changed data indicated in thesynthesized bitmap. The live clone image can be prep-mounted for a scruboperation (e.g., to remove sensitive information before testing ordevelopment), which includes generating both a copy of the live cloneand a bitmap indicative of the data scrubbed during the scrub operation.If the scrub operation is approved, then the bitmap and copy can bediscarded. If the scrub operation is not approved, the bitmap can beused to copy only changed data from the copy of the live clone.

FIG. 35 is an exemplary diagram illustrating the creation process of alive clone image from a backup image of application 3501, according tosome embodiments. Backup images 3502 of application 3501 are createdbased on a pre-defined SLA. Over time this will result in multiplebackup images 3503, 3504, 3505, 3506 being created for the application3501. Differences between backup images are captured as a set of bitmaps3507, 3508, 3509 for backup image 3504, 3505, and 3506, respectively.

The live clone is created, as indicated via arrow 3510, by copying datablocks from the source backup image 3503 into a new live clone image3511. At the end of the live clone creation process the live clone image3511 is an exact copy of the source backup image 3503.

FIG. 36A is an exemplary diagram illustrating the refresh process forlive clone image 3611 from a previously created backup image 3606,according to some embodiments. As described above for FIG. 35, backupimages 3602 of application 3601 are created based on a pre-defined SLA,resulting in backup images 3603, 3604, 3506, and 3606 and associatedbitmaps 3607, 3608, and 3609. The live clone 3611 can be refreshed bycreating a synthesis of all prior backup images (e.g., bitmaps 3607,3809, and 3609), as indicated via arrow 3610. The synthesized bitmap3610 can be used to only copy changed data blocks into the existing liveclone image 3611.

FIG. 36B is an exemplary diagram illustrating a computerized method forthe refresh process shown in FIG. 36A, according to some embodiments. Atstep 3650, the backup bitmaps (e.g., flash-copy bitmaps maintained bythe hardware) generated between the most recent backup image used tocreate or refresh the live clone image (e.g., backup image 3603) and thebackup image to which the live clone is to be refreshed (e.g., backupimage 3606) are identified. At step 3652, a synthesized bitmap iscreated based on the bitmaps identified in step 3650 (e.g., bitmaps3607, 3608, and 3609). At step 3654, the changed data blocks are copiedbased on the synthesized bitmap. At the end of the refresh operation thelive clone image 3611 is an exact copy of the source backup image 3606.For example, U.S. patent application Ser. No. 13/920,981, entitled“Smart Copy Incremental Backup,” describes an example of bitmaps orextents that can be used with the techniques described herein, which ishereby incorporated by reference herein in its entirety.

Referring to step 3654, once the changed blocks are identified using thebitmap, the corresponding blocks are copied from the disks belonging tothe source image to the disks belonging to the destination volume.

FIG. 37 is an exemplary diagram of the prep-mount process for a liveclone image which has been previously created (e.g., as described inFIG. 35) or refreshed (e.g., as described in FIGS. 36A and 36B) to scrubthe live clone image, according to some embodiments.

The prep-mount operation is different from a traditional mount operationin that the prep-mount operation is used to scrub sensitive data fromthe live clone backup image 3701 before it is used (e.g., fordevelopment and/or testing). In some embodiments, during a prep-mount ofa live clone, a reference image 3703 of the live clone is created. Thelive clone reference image 3703 contains flash-copies of the disks thatare contained within the live clone image 3701. The system can use abitmap to keep track of changes the host 3705 makes to the live clone3701. Because the reference image 3703 is created from the live cloneimage 3701 before the live clone image 3701 is mounted to the host 3705(e.g., and therefore before any changes are made by the host 3705),there is an empty bitmap 3702 associated with the flash-copy mappingbetween the live clone image 3701 and the reference image 3703.

The host 3705 can modify the contents of the live clone image 3701 oneit is mounted, as indicated by arrow 3704, to the specified host 3705.The scrub operation 3706 therefore creates a modified live clone image3707 that is different from the original live clone image 3701, which istherefore also different than the reference image 3703. These changesare represented by the bitmap 3708 that indicates the changes made bythe host 3705. For example, if a production database contains sensitiveinformation like social security numbers, a scrub operation would benecessary, where the liveclone image is mounted to a scrubbed host andscripts run against the mounted image to mask the social securitynumbers. For example, U.S. Provisional Patent Application No.61/905,342, entitled “Test-and-Development Workflow Automation,”provides an exemplary use of a liveclone during a workflow automation,which is hereby incorporated by reference herein in its entirety.

FIGS. 38A and 38B are exemplary diagrams illustrating the prep-unmountoperation on a live clone image 3802 that has been created (e.g., asdescribed in FIG. 35) or refreshed (e.g., as described in FIGS. 36A and36B) from a backup image 3801, and which that has been prep-mounted 3804to a host 3805 (e.g., as described in FIG. 3). During prep-unmountoperation the user can choose to either discard (e.g., as shown viaarrow 3806 in FIG. 38A) the changes made to the prep-mounted live cloneimage 3802 or to preserve the changes (e.g., as shown via arrow 3808 inFIG. 38B) made to the live clone image 3801.

Referring to FIG. 38A, if the user decides to discard 3806 the changesmade to the prep-mounted live clone image 3802 (e.g., the changed madeduring the scrub operation), then the bitmap 3807 can be used to reversethe changes. The bitmap 3807 maintains the changes between the liveclone image 3802 and the live clone reference image 3803. The bitmap3807 can be used to generate a list of changed blocks that will becopied, as indicated via arrow 3809, from the disks within the referenceimage 3803 to the live clone image 3802. Therefore, arrow 3809 indicatesa bitmap-based incremental data transfer to restore the contents of thelive clone image 3802 after the live clone image 3807 is unmounted fromthe host 3805. At the end of the copy operation 3809 the live clonereference image 3803 is discarded.

Referring to FIG. 38B, if the user decides to retain the changes made tothe live clone image 3802, then the bitmap 3807 that represents thechanges between the prep-mounted (e.g., described in FIG. 37) live cloneimage 3802 and the reference image 3803 is now preserved and associatedwith the live clone image 3802 and its source image 3801, as indicatedvia arrow 3810. The bitmap captures data that has been changed since theliveclone image was prep-mounted. This allows the user to “discard”changes made to the live clone. At the end of the prep-unmountoperation, the reference live clone image 3803 is discarded.

Workflow Automation

Traditional means to provision up-to-date data for development andtesting of business applications often involve a lot of manual processesand require coordination from multiple parties with distinct skill sets.It is rare that organizations have efficient test data managementsoftware to manage the movement of data in support of their developmentefforts. Because of this, a typical development project might require 5to 10 individual copies of a production database and drive 50+ TB ofstorage requirements. And the time to provision 50 TB of copies for testand development can take weeks.

FIG. 39A shows an example workflow of provisioning a copy of productiondata for test and development. First a developer submits a request fordata in step 3950. It takes about 1-2 weeks for management to approvethe request in step 3951. It then takes a storage administrator about 1week to provision required storage resources in step 3952 and a systemadministrator about 3 days to create backup images in step 3953. Adatabase administrator now takes over and it would takes him or herabout 1-3 weeks to clone, refresh and scrub the backup data in step3954. It then takes about 3 days for the system administrator to rebuildthe file system from the cloned and scrubbed backup data in step 3955.The process next goes to management to get approval in about 1 week instep 3956. Finally the developer can begin development work in step3957. About 5-6 weeks have passed since the original request wassubmitted until development can begin.

Likewise, because many organizations usually lack test data managementsoftware that can easily update a data set without disrupting productionsystems, teams are often left to use out-of-date data, which can oftenprovoke additional and unnecessary development cycles and “bug” fixeswhen an application update comes face to face with the reality ofcurrent data.

The end result of this traditional process is that development teams areleft waiting, less development occurs in a given timeframe, milestonesand roadmap dates are missed, application quality suffers, deliverydates are extended and potential revenue-generation is deferred.

The techniques described herein allow a workflow to be defined thatspecifies a series of automation steps, trigger points, and serial andparallel operations for generating a live copy of a database (e.g., fortesting and/or development). The workflow specification can be saved andlater run in an automated fashion to generate (or update) the live copy,including removing sensitive data that should not be present in a testor development dataset. The workflow can include optional paths anddecision points which determine what to do in certain unexpectedsituations. At the beginning of the next cycle of the workflow, there isan implicit cleanup of resources left in use by the last cycle.

Automated Test-and-Development Process

FIG. 39B illustrates a flow diagram of a Test-and-Development processleveraging workflow automation technology. The computerized process canreduce complex interactions between different functional groups andsignificantly expedites availability of production data for developmentand testing.

A production application (e.g., production application 3900 in FIG. 39C)receives a Service Level Agreement (SLA) in step 3980 from a storageadministrator. A SLA describes the data protection characteristics foreach stage of the data lifecycle of a business application. Applying theSLA will create snapshot backup images of production data withpredefined schedules. For example, U.S. patent application Ser. No.12/947,385, entitled “System and Method for Managing Data with ServiceLevel Agreements that may Specify Non-Uniform Copying of Data,”describes SLAs, and is incorporated by reference herein in its entirety.A developer (e.g., who is in need of copy of production data) createsand applies a workflow to the production application in step 3981. Aworkflow models the underlying data flow for the test-and-developmentprocess and defines an automated procedure to drive that data flow. Whena workflow service (e.g., Workflow Service 3914 in FIG. 39C) activatesthe workflow, it starts the operations to clone, refresh and scrubbackup images of the production data in step 3982. The cloned andsanitized data is then mounted to all requested test and developmentapplications in step 3983. The fully automated process from step 3981 tostep 3983 takes about 12-15 hours.

The disclosed test data management functionality can deliver benefitswhile enabling organizations to better meet the needs of the developmentand test teams. For example, some potential benefits can include:

-   -   Providing an instant clone enables the creation of a development        sandbox without impacting production or taking substantial        resource time to provision.    -   Instant mount can be used to rapidly feed data to masking and        sub-setting processes.    -   A LiveClone enables data to be updated automatically from        production in a space efficient manner, allowing development and        test team to work with near real time data over the lifecycle of        the project.    -   Development and test teams can gain instant access to copies for        the development lifecycle without additional license        requirements, thereby reducing project costs.    -   With more rapid data access timelines, application development        project schedules can be accelerated.    -   By using up-to-date instances of production data for        development, testing, and QA, code quality is improved, rework        requirements are reduced, and business acceptance is completed        much more quickly.

Test-and-Development Workflow Automation

FIG. 39C illustrates the data flow for the Test-and-Development processleveraging workflow automation technology. FIG. 39C includes productionapplication 3900, sanitization application 3902, test-and-developmentapplication 3904. FIG. 39C also includes data management virtualizationengine 3906 and related subcomponents, including storage service 3908,copy data service 3910, LiveClone and Super Scripting Service 3912 andworkflow service 3914. FIG. 39C also includes Storage Resources 3916 andall the contained data volumes related with the test-and-developmentprocess, primary storage 3918, snapshot 3919, LiveClone 3920 andsnapshot 3921.

FIG. 39C illustrates the interactions of Workflow Service 3914 withother components of the Data Management Virtualization Engine 3906 togenerate the snapshot(s) 3921 from the Primary Storage 3918 for use bythe Test-and-Development Application 3904. For example, U.S. patentapplication Ser. No. 12/947,385 describes an exemplary Data ManagementVirtualization system in FIGS. 2-4 and their associated description,which is hereby incorporated by reference herein in its entirety.Production Application 3900 is customer's deployed business application.For example, a customer may utilize the techniques described herein whenit wants to develop and test new applications or new releases of thesame application, which are depicted by Test-and-Development Application3904. Test-and-Development Application 3904 requires testing with realproduction data owned by Application 3900. However data owned byapplication 3900 contains sensitive information, which are not allowedto leave the production environment by legislation or company policies.Examples of such sensitive information can be customers' social securitynumbers, names, phone numbers or other privacy related matters. It canalso be any documents deemed to be critical to the business unitoperation and should not leave production environment in clear text. Inorder to procure the production data from Application 3900 and make itavailable for Test-and-Development Application 3904, SanitizationApplication 3902 is deployed to “cleanse” the production data removingor scrambling the sensitive information before passing it on toTest-and-Development Application 3904. Workflow Service 3914 automatesand coordinates the data movement and transformation by invoking andcoordinating functions and features embodied by Storage Service 3908,Copy Data Service 3910 and LiveClone & Super Scripting Service 3912,which run side by side within a single Data Management VirtualizationEngine 3906. This process is explained in more detail below.

Primary Storage 3918 is where application data is stored through itslifecycle. Primary Storage 3918 is mounted to Production Application3900 through Storage Service 3908, as shown by line 3928. Similarlystorage resources LiveClone 3920 is mounted to Sanitization Application3902 (as indicated by line 3934) and finally storage resources Snapshot3921 is mounted to Test-and-Development Application 3904 (as indicatedby lines 3936 a-3936 c), all through Storage Service 3908.

Referring to the Workflow Service 3914, the Workflow Service 3914 isconfigured to execute a workflow (e.g., workflow 4000 a-4000 f in FIG.40) that is defined to move production data from Application 3900 toTest-and-Development Application 3904 using Sanitization Application3902 to “scrub” the production data. When the Workflow Service 3914starts execution of a workflow, Workflow Service 3914 coordinates otherdata services within the same Data Management Virtualization Engine 3906(e.g., services from the Storage Service 3908, the Copy Data Service3910 and the LiveClone & Super Scripting 3912) to run each definedworkflow item (e.g., WorkflowItem 4104 of FIG. 41). A workflow itemmodels a single step of data transformation of the entire process and isa basic operation unit of the workflow. The end result is a successionof provisioned storage resources, embodied by Snapshot 3919, LiveClone3920 and Snapshot 3921. The arrows 3930, 3932, and 3938 a-3938 c showthe data flow of the production data between provisioned storageresources 3916 by the services of the Data Management VirtualizationEngine 3906, which is explained more fully below. Arrows 3930, 3932, and3938 a-3938 c are dashed to indicate the data flow across the StorageResources 3916 that is coordinated by the Workflow Service 3914.

Data protection lifecycle requirements of Production Data 3900 stored onPrimary Storage 3918 are normally captured by a Service Level Agreement(SLA), which controls for example when and how often backups andsnapshots are created for data on the Primary Storage 3918. In theexample shown in FIG. 39C, when the SLA for Application 3900 isenforced, Copy Data Service 3910 creates Snapshot 3919 from PrimaryStorage 3918 (this is shown by the line 3930). Snapshot 3919 is astaging data volume, which contains a point in time backup image of thePrimary storage 3918.

When a workflow trigger (e.g., Trigger 4106 of FIG. 41, described inmore detail therein) is activated, Workflow Service 3916 startsexecution of WorkItem objects defined by the Workflow. The execution ofa workflow item invokes its associated work action (e.g., WorkAction4110 of FIG. 41). For example, a live clone action (e.g.,LiveCloneAction 4111 of FIG. 41), which is a subtype of WorkAction,creates storage resources LiveClone 3920 through LiveClone & SuperScripting Service 3912. This is shown via arrow 3932. A liveclone is astaging data volume, which contains a clone from a backup image for thepurposes of test and development. Liveclone 3920 can be refreshedincrementally from Snapshot 3919, making these operations inexpensive.Workflow triggers, Workitems and WorkActions are described in moredetail with reference to FIG. 41.

Upon creation of LiveClone 3920, Workflow Service 3914 mounts theLiveClone backup image to Sanitization Application 3902 through StorageService 3908's mount operation, shown via line 3934. SanitizationApplication 3902 utilizes LiveClone & Super Scripting Service 3912 toinvoke a pre-script before the mount operation and a post-script afterthe mount operation. In some embodiments, the pre-script prepares thesanitization application before the LiveClone backup image is mounted.For example the pre-script should shutdown the database so thatunderlying data files can be swapped with the LiveClone backup image. Insome embodiments, the post-script contains the application specificlogic to remove or scramble the sensitive information contained in thebackup image. When Workflow Service 3914 notices the completion of thescripts invocation, it calls into LiveClone & Super Scripting Service3912 and Storage Service 3908 again to unmount LiveClone 3920 fromSanitization 3902 while preserving all the changes made by the scripts.The production data is now copied, sanitized and ready for consumption.

The last step of executing a workflow is to mount Liveclone 3920 to eachrequested Test-and-Development Application 3904. In doing so WorkflowService 3914 calls into Copy Data Service 3910 to create as manyrequested snapshot copies of LiveClone 3920 to produce the resultingSnapshot(s) 3921, as indicated by arrows 3938 a-3938 c. Depending on theconfiguration, Workflow Service 3914 can mount instances of Snapshot3921 sequentially or in parallel to each requested Test-and-DevelopmentApplication 3904, as indicated by lines 3936 a-3936 c (e.g., so eachtesting and/or development group has their own copy of the data).

The workflow finishes its execution and production data is safelyprocured, sanitized and consumed by test and development applications.The workflow, once defined, is persisted in Workflow Store 4010 as shownin FIG. 40 and available for reuse. When one of the workflow's triggers(e.g., Trigger 4106 of FIG. 41, described in more detail therein) isactivated again the entire process described above repeats so thatTest-and-Development Application 3904 can start its new cycle ofdevelopment effort with refreshed production data set.

Workflow Service

FIG. 40 shows the decomposition of Workflow Service 3914 from FIG. 39C,according to some embodiments. FIG. 40 includes applications 3900 a-3900c, which interact with the Workflow Service 3914, which in turninteracts with the Workflow Store 4010. In this example, WorkflowService 3914 consists of five main components, which includes Workflows4000 a-40000 f (collectively referred to herein as “Workflow 4000”),Workflow API 4002, Workflow Scheduler 4004, Workflow Management 4006,Workflow Monitoring 4008. Lastly all defined workflow artifacts arepersisted into Workflow Store 4610. The Workflow Service 3914 exposesthe underlying functionalities through Workflow API 4002, which clientscan use to manage the lifecycle of a workflow object. Thefunctionalities exposed through Workflow API 4000 can be, for example,collectively provided by Workflow Scheduler 4004, Workflow Management4006 and Workflow Monitoring 4008. Workflow Management 4006 is the maincomponent responsible for creating, updating and querying workflowobjects (e.g., Workflow 4000) defined within the Data ManagementVirtualization Engine 3906. Workflow Monitoring 4008 is the componentthat client uses to query and monitor the status history of eachworkflow run. Workflow Scheduler 4004 maintains the schedules for eachworkflow object. It is the main source for triggering workflowexecution. All four components described above use Workflow Store 4010as the persistent storage to keep track of workflow configuration,states and run history.

The basic operation unit of all five components is Workflow 4000, whichcaptures the abstraction of the data flow for the Test-and-Developmentprocess. It is the central data structure that Workflow API 4002 exposesand operates on, which controls data movement as described above in FIG.39C. Each Application 3900 object can have multiple associated Workflow4000 objects with each operating independently from another fordifferent uses of the production data. Test and development is one majoruse case of Workflow Service 3914 but it can be extended to automateother uses of production data, which require multiple steps oftransformation.

Workflow Anatomy

FIG. 41 shows the decomposition of Workflow 4000. In general eachWorkflow 4000 consists of multiple execution steps (shown as therelationship “* steps” left to WorkItem 4104, in which “*” meansmultiple here and below), each responsible for transformation of theinput data in some way and is modeled by WorkItem 4104. Each workflowitem is a subtype of WorkGroup 4102 and has an associated WorkAction4110 (shown as the relationship “action” left to WorkAction 4110).WorkAction 4110 has a set of subtypes embodied in LiveCloneAction 4111,SanitizeAction 4112 and MountAction 4113. Each Workflow 4000 also has aset of associated Triggers 4106 (shown as the relationship “* triggers”left to WorkItem 4104), each of which defines the conditions when met toinvoke the owning Workflow 4000. Trigger 4106 has a set of subtypesembodied in CronTrigger 4107, ManualTrigger 4108 and EventTrigger 4109.

WorkItem 4104 abstracts an execution step within Workflow 4000. Itrepresents a unit of work, which defines a distinctive phase oftransformation of the source data. Arrow 3932 in FIG. 39C exemplifies awork item, “liveclone step”, for the test-and development process. Ittakes the snapshot of the production data (Snapshot 3919) and creates orrefreshes a LiveClone backup image (LiveClone 3920) from Snapshot 3919.Arrow 3938 a-3938 c in FIG. 39C together exemplifies another work item,“mount step”, for the test-and development process. It takes LiveClone3920 produced from “liveclone step” and mounts the snapshots ofLiveClone 3920 to multiple target test-and-development applications.

WorkItem 4104 is a subtype of another data structure, WorkGroup 4102.WorkGroup 4102 is a folder like data structure, which can containinstances of itself as child members (shown as the relationship “*children” left to WorkGroup 4102). This essentially enables WorkItem4104 to model the workflow step which itself consists of child steps(shown as the relationship “* steps” above WorkItem 4104). Theparent-child relationship is recursive in nature but normally eachWorkItem 4104 models a single step and contains no more than one levelof child WorkItem objects. Each work item decides its succession workitem based on the outcome of executing its associated WorkAction 4110and by configuration. The “onSucess/onFailure” relationship shown belowWorkItem 4104 describes the next WorkItem 4104 in line to be executed byWorkflow Service 3914. If the outcome of executing the work item'sassociated WorkAction 4110 is success, then the work item configured for“onSucess” relationship will be chosen. Otherwise the work itemconfigured for “onFailure” will be selected as the next workflow itemfor execution.

WorkItem 4104 has an associated data structure embodied by WorkAction4110, which abstracts the action that WorkItem 4104 should take ifWorkflow Service 3914 invokes the step it represents. A WorkActionobject defines concrete operations that should be carried out by variousservices provided by Data Management Virtualization Engine 3906 toachieve the data transformation or movement objective of the owningWorkItem 4104. Such operations include, but not limited to, snapshottingdata volumes, creating or fresh liveclone volumes, mounting andunmounting data volumes from/to applications. The operations are groupedinto two main methods defined by WorkAction 4110. The “execute” method(shown in the call-out box above WorkAction 4110) groups the operationsto carry out the specified data transformation objective and the“rollback” method (shown in the same call-out box) groups the operationsto undo the results of the execute method if any should the executemethod runs into any failure and needs to start the recovery process.

At runtime each WorkItem 4104 is associated with an instance of aconcrete subtype of WorkAction 4110 embodied by LiveCloneAction 4111,SanitizeAction 4112 and MountAction 4113. For the test and developmentprocess described in FIG. 39C conceptually the Workflow 4000 objectconsists of three WorkItem 4104 objects. Workflow Service 3914 executesthe WorkItem 4104 objects in succession and corresponding WorkAction4110 objects, LiveCloneAction 4111, SanitizeAction 4112 and MountAction4113 in that order. LiveCloneAction 4111 is responsible for creating andrefreshing liveclone backup images. SanitizeAction 4112 is responsiblefor mounting and “scrubbing” the liveclone backup image to removesensitive information. Finally MountAction 4113 is responsible to mountthe sanitized snapshot backup images to the Test-and-DevelopmentApplication 3904. If multiple target Test-and-Development Application3904 objects are specified for the last mount step, the parent WorkItem4104 object representing it will contain multiple child WorkItem 4104objects, each representing a target Test-and-Development Application3904 object. The parent or the “macro” WorkItem 4104 can choose to carryout the operations of each MountAction 4113 object associated with eachchild WorkItem 4104 object sequentially or in parallel if the underlyinginfrastructure supports it.

Each WorkItem 4104 object decides its succession WorkItem 4104 objectbased on the operation outcome of its associated WorkAction 4110 object.If the execute method of WorkAction 4110 returns success, the onSuccessmethod of WorkItem 4104 returns the next WorkItem 4104 object and callsits WorkAction 4110 object's execute method to keep the workflow rollingforward. If the onSucess method returns null it signals the end ofWorkflow 4000 invocation. If the execute method of WorkAction 4110returns failure, the onFailure method of WorkItem 4104 returns the nextWorkItem 4104 object and calls its WorkAction 4110 object's rollbackmethod to start the recovery or unwinding process. The return results ofonSuccess and onFailure methods of each WorkItem 4104 object areconfigured when the containing WorkFlow 4000 object is defined. Acomplete run of Workflow 4000 is a successful traversal of containingtop-level WorkItem 4104 objects without failure.

As described above each Workflow 4000 has a set of associated triggersembodied by Trigger 4106. Trigger 4106 specifies the condition whetheror not Workflow Service 3914 should activate Workflow 4000. Trigger 4106is generally defined according to the source of triggering eventsembodied by CronTrigger 4107, ManualTrigger 4108 and EventTrigger 4109.CronTrigger 4107 defines a means of an activation schedule using a“cron-expression”. A cron expression is a string consisting of six orseven subexpressions (fields) that describe individual details of theschedule. A cron expression is a string consisting of six or sevensubexpressions (fields) that describe individual details of theschedule. One example of cron-expression is as follows,

Expression Meaning “0 0 8 * * ?” Fire at 8:00am every day

ManualTrigger 4108 allows Workflow 4000 to be activated on-demandbypassing all the conditions set up for other types of triggers.EventTrigger 4109 allows Workflow 4000 to be activated in response tocertain system events, a typical source of events are SNMP traps thatreceived by the system.

Defining a Workflow

A workflow can be defined through either a GUI interface or a commandline interface, which should specify all the aspects laid out in section“Workflow Anatomy”. The core task of defining a workflow is to specifythe detailed operations and parameters for each member workflow items.FIG. 44 is an exemplary diagram of a graphical user interface fordefining a workflow (e.g., as embodied in FIG. 39C), according to someembodiments.

FIG. 44 is an example defining the “liveclone step” as embodied by arrow3932 and “sanitization step” as embodied by arrow 3934 in FIG. 39C. Thefollowing descriptions highlight some configuration parameters as shownin the screenshot,

-   -   Workflow Name 4401. The name of the workflow under definition.    -   LiveClone Settings 4402. This section specifies all the        parameters required for creating and refreshing a LiveClone        backup image.        -   Image to Use. Specifies which snapshot of the application to            be used for LiveClone creation.        -   Refresh/Mount. Specifies the schedule as when to start the            workflow as embodied by CronTrigger 4107.    -   Mount for pre-processing 4403. This section specifies all the        parameters required for removing sensitive information from the        LiveClone backup image.        -   Host. Specifies the target host to carry out sanitization            action as embodied by Application 3902.        -   Pre-Script/Post-Script. Specifies the pre-script and            post-script for SanitizationAction 4112.

FIG. 45 is an example of mounting a live clone to multiple applications.Once the liveclone volume is refreshed and sanitized, it is ready to bemounted to the final development hosts. The same liveclone volume can bemounted to multiple hosts from within the workflow. The followingdescriptions highlight some configuration parameters as shown in thescreenshot,

-   -   Label 4501. All mounted volumes will be tagged the same label        for easy identification.    -   Candidate hosts 4502. All hosts available for mount operation        are listed in the left panel.    -   Target hosts 4503. All currently selected hosts for mount        operation are listed in the right panel    -   Mount parameters 4504. The set of parameters for mount operation        which include,        -   Mount mode. Applicable for VMware virtual machines. vRDM for            virtual RDM and pRDM for physical RDM        -   Mount Drive. Applicable for Windows hosts. Starting drive            letter for the mounted volumes.        -   Mount Point. Applicable for both Linux and Windows hosts.            The starting mount point within the target hosts file            system.        -   Pre-script and Post-script. The names of the script launched            before and after the mount operation.

Detailed Workflow Execution Logic

FIG. 42 illustrates a flowchart of the execution logic of Workflow whenactivated by Workflow Service 3914, according to some embodiments.

Step 4200 shows the start of the execution. Step 4202 checks if previousactivation of the same workflow is still in progress, it goes to the endof the execution, Step 4220 if this is case. If no previous activationis running, the logic goes to Step 4204. Step 4204 checks and freessystem resources from previous activation if any is left. It thenchanges to Step 4206 to check if the pre-condition to execute theworkflow is met. If the result is false, workflow execution ends tochanges to Step 4220. If pre-condition check passes, the executionchanges to Step 4208. Step 4208 finds the starting WorkItem 4104 and itchanges to Step 4210 to instantiate and calls WorkItem's 4104 associatedWorkAction 4110. If the outcome of the executing WorkAction 4110 issuccess, Step 4214 finds the next WorkItem 4104 by calling currentWorkItem's 4104 onSuccess method. If it the return result is not nullsignaling there are more WorkItem 4104 objects to be executed, executionlogic returns to Step 4208. The looping logic between Step 4208 and Step4216 repeats until Step 4216 signals the exhaust of all WorkItem 4104objects. It then changes to Step 4218 to check if the post-condition forexecuting the workflow holds. If it is case, the execution changes toStep 4220 and ends. If the post-condition is invalid, execution changesto Step 4232 to report failure and then ends at Step 4220.

In Step 4212 if the outcome of executing current WorkItem 4104 isfailure, execution changes to Step 4222 to find the next WorkItem 4104by calling current WorkItem 4104's onFailure method. If the returnedresult is not null, execution changes to Step 4224 to start the rollbackprocess. Execution moves to Step 4226 to instantiate WorkAction 4110associated with WorkItem 4104 identified in Step 4222 and calls itsrollback method. Execution moves to Step 4228 to calls the onFailuremethod of WorkItem 4104 identified in Step 4224 to find the nextWorkItem 4104. It then moves to Step 4230 to check if the retuned resultis null. If the result is not null, it loops back to Step 4224 andrepeats the rollback process. If the returned result is null, it signalsthe end of the rollback process and moves to Step 4232 to report thefailure and moves on to Step 4220 to finish the execution of the entireworkflow.

FIG. 43 is a flowchart, which shows the execution of WorkflowItem 4104and gives a magnified view of Step 4210 in FIG. 42. In this example,WorkflowItem 4104 models a step of the Test-and-Development process andeach Workflow 4000 can consist of multiple WorkflowItem 4104 objects tomodel the entire Test-and-Development process. Execution starts at Step4300. It changes to Step 4302 to check if WorkItem 4104 contains anychild WorkItems. If the result is yes it changes to Step 4304 to checkif all child WorkItems should be executed sequentially or in parallel.If check result returns true execution changes to Step 4306 to enumerateall child WorkItems. It then changes to Step 4308 to instantiateWorkAction 4110 for each child WorkItem 4104 and calls its executemethod in parallel. Execution then moves to Step 4310 to combine theresult from the execution of each child WorkItem 4104. Execution thenmoves to Step 4312 to logic specific to the WorkItem 4104 itself andmoves to Step 4314 to finish the execution of WorkItem 4104 and returnsfinal result back.

If check result is false at Step 4104 execution changes to Step 4116 toenumerate all child WorkItems. For each child WorkItem 4316 executionchanges to Step 4318 to instantiate and calls its associated WorkAction4110. It then moves to Step 4320 to check if there is any child WorkItem4104 remains unexecuted and will loop back to Step 516 if it is case. Ifno child WorkItem 4104 remains unexecuted control changes to Step 4310.

Data Management Virtualization

As the value of data has increased, and the acquisition cost pergigabyte of data has dropped, enterprises have been deploying larger andlarger storage systems. This has particularly been the case withunstructured data, which is usually stored in large file systems, veryoften accessed over network by multiple servers using industry standardnetwork-attached storage (NAS) protocols such as Network File System(NFS) and Common Internet File System (CIFS).

While the capacity of these NAS devices has gone up, the data protectioncapabilities, namely backup and restore and replication have not keptpace. This has, for example, led enterprise users to cobble togetherinefficient and expensive solutions for protecting their data, or gowithout complete protection.

The techniques described herein protect and manage the lifecycle of datain large Network Attached Storage (NAS) deployments with high efficiencyand with the ability to scale with the growth of the NAS System. Inaddition to data protection, the system described here can leverage thedata within the NAS System for additional purposes such as test &development, analysis, reporting, e-discovery and similar functions. Thetechniques described herein can also protect and manage the lifecycle ofdata in big data systems (e.g., Hadoop, MongoDB).

Since NAS is, by its nature, often remotely-located from a backupdevice, NAS must usually be mounted to a host prior to performingbackups. The techniques described herein provide for a NAS backup proxythat, in some embodiments, is external to the copy data managementserver. A NAS server can include a management component that provides anAPI for invoking functions on the NAS server, such as a snapshotfunction that generates a snapshot of the NAS server and/or a changetracking function that tracks changes made to data stored on the NASserver. By using such NAS functionality, the backup process can be veryfast and efficient since only changed data (e.g., deletions,modifications) are copied from the NAS server.

By remotely locating the NAS backup proxy from the copy data managementserver, NAS backup proxies can be added for scalability such that moreNAS backup proxies can be added as the data in the NAS storage grows(e.g., while still using a single virtual data pipeline to the copy datamanagement server). Additionally, as described herein the NAS backupproxy can be selected such that it is compatible with the NAS server(e.g., NFS for Unix/Linux and SIFS for Windows).

Large enterprise computing systems today often include large amounts ofboth structured and unstructured data.

Structured data is characterized by having a well-defined format, withlarge numbers of similar items, each item of data having relationshipswith other items of data. Such data is most often stored in databases,such as relational databases, object databases, and even specializeddatabases such as email repositories. As such storage repositories haveevolved and grown in size, mechanisms have developed from protecting andreplicating the data within these repositories. The relational databasesystem developed by Oracle Corporation, for example, includes tools suchas the Oracle Recovery Manager (RMAN) and others, developed by Oracleand by third parties that enable end users to manage the life cycle oftheir structured data.

Unstructured data is characterized by being heterogeneous, of not havinga well defined form with larger individual items, each with their ownmetadata. An example of unstructured data is a large collection of textfiles, documents, spreadsheets, images, audio and video files. Each filemay be sizable, and has metadata such as a filename, file owners, dateof creation and modification and other attributes. Unstructured data isoften stored in file systems, and when shared access to such arepository is required, these file systems are shared over a network inwhat is described as Network Attached Storage (NAS).

A NAS system can be designed to hold unstructured data, which is madeaccessible to multiple host computers using a well-defined file-accessprotocol such as CIFS or NFS. Such a system is often designed to scaleto large or very large sizes, from tens of terabytes to severalpetabytes. A large NAS deployment may hold tens or hundreds of millionsof individual files, and may be accessed by thousands of computersystems at the same time.

Most modern NAS servers include the ability to take snapshots of thefile system state, and often include an interface by which you candetermine what files have changed between one snapshot and another.

Whereas the life cycle management of structured data has generally keptpace with the growth of structured data, the same cannot be said for NASServers. When NAS Servers get to multiple terabytes in size, it becomesimpossible to back them up with conventional backup tools. Backups taketoo long, and impose too much of a load on the production systems. TheNAS vendors only offer replication, where the data in a NAS System canbe sent across a wire to a similar NAS system at a remote location. Thistechnique is an expensive solution that addresses site failure, but doesnot address operational data loss.

The techniques described herein include a system that can efficientlyprotect a large NAS system, and can grow with the NAS system. It takesadvantage of the snapshotting capabilities within the NAS system, andthe ability to identify changed files between snapshots.

A Copy Data Management system can be enhanced with the addition of oneor more NAS Backup Proxy hosts that serve to backup some or all of oneore more NAS Servers. Multiple Proxy hosts can be added to a single CopyData Management server to scale with the growth of the NAS Server. Forexample, U.S. patent application Ser. No. 13/920,981, entitled “Systemand Method for Incrementally Backing Up Out-of-Band Data” describes anexample of a virtualized data management system, which is herebyincorporated by reference herein in its entirety.

FIG. 46 illustrates the relationship between the Copy Data ManagementSystem and the rest of the enterprise systems, according to someembodiments. The Customer environment consists of a collection ofphysical and virtual machines, 4600. These are protected by the CopyData Management Server 4605 using storage in the form of Copy DataStorage Pools 4606. Copy Data Storage Pools 4606 are, for example,storage that the customer has specifically reserved for storing copiesof production data. The enterprise environment also includes one or moreNAS Servers 4601 and 4602, which consist of NAS service nodes with theirown storage 4601 a-4601 c and 4602 a-4602 c, respectively. NAS Servers4601 and 4602 are exemplified by offerings from EMC, Network Appliance,and other Na vendors. The Copy Data Management System has been enhancedwith the addition of one or more NAS backup Proxy Servers, 4603 and4604. Two types of NAS Backup Proxies are illustrated, one with its ownsnapshot and deduplicated storage, 4603 and the other without its ownsnapshot and deduplicated storage, 4604. NAS Backup Proxy Server 4604 isconfigured without its own storage. This type of NAS Backup Proxy Server4604 can share storage with the Copy Data Management Server 4605. NASBackup Proxy Server 4603 is configured with its own Copy Data StoragePool 4603 a. Having its own storage can be useful if the NAS BackupProxy Server 4603 is at a distant location, where it would beimpractical to share the Copy Data Storage Pool 4606 with the Copy DataManagement Server 4605. The user may also choose to deploy NAS BackupProxy Servers 4603 with their own Storage for other reasons, such askeeping storage reserved for NAS Protection, or the ability to expandthe Copy Data Storage as more and more NAS Servers are protected. Notethat there is not necessarily a one-to-one relationship between the NASServers and the NAS Backup Proxy hosts. There may be more or less NASBackup Proxy hosts depending on the size and capacity of the NAS Serversand of the proxy hosts. For example, one NAS Backup proxy may protectseveral NAS Servers, and a single NAS Server may be protected by severalNAS Backup Proxy hosts.

FIG. 47 illustrates the high level components that are active during thebackup and mounting of NAS systems, according to some embodiments. Thefile system presented by the NAS Server 4701 is mounted by the NASBackup Proxy host 4710 using standard protocols such as NFS and CIFS, ora proprietary protocol if required by the NAS vendor. Backups can bemounted as either CIFS or NFS regardless of whether the original NASexport was one or the other. The NAS Backup Proxy Server 4710 includesvarious services including the Proxy Copy Service 4703, the Search andIndexing Service 4704, the Mount Service 4705, the Management Services4706 and other services such as Compliance 4713. There may also be aFilesystem Snapshot Service 4711, which is usually used when the NASBackup Proxy host has its own snapshot and deduplicated storage. TheVirtual Volumes 4709 are created from the Copy Data Storage Pools 4708by the Orchestration Engine 4713 and presented to the NAS Backup Proxyhosts 4710. The Orchestration Engine 4713 can communicate with variouscomponents on the NAS Backup Proxy 4710, as explained further withreference to FIGS. 48, 49, and 50.

The Copy Data Management Server 4707 includes the Orchestration Engine4713 and the Virtual Disk Snapshot service 4712. The Virtual DiskSnapshot Service 4712 is usually used when the NAS Backup Proxy 4710does not have its own storage. The Orchestration Engine 4713communicates with the NAS Management component 4714 to issue commands tocreate and delete NAS Server 4701 snapshots and to compare snapshots togenerate lists of modified files.

The Copy Data Management Server 4707 controls the Copy Storage Pools4708, and may apportion some of the storage in the pools to one of theNAS Backup Proxy hosts in the form of the Virtual Volumes 4708.

The NAS Backup Proxy Host 4710 contains many Services that are usedwithin the Copy Data Management System 4707 to perform protection andrecovery operations.

In some embodiments, the Proxy Copy Service 4703 is responsible formounting a snapshot of a NAS Server 4701 onto the NAS Backup Proxy Host4710, formatting Virtual Volumes 4709, creating filesystems on thevirtual volumes, and then copying all files or changed files only fromthe mounted NAS snapshot to the filesystem. The Proxy Copy Service 4703can also be responsible for creating a list of files copied along withtheir metadata, and communicating with the Search and Index Service 4704to generate an index of the files backed up.

The Search and Index Service 4704 is, for example, a general purposesearch engine which is capable of breaking up its inputs into words andgenerating an index database that can quickly be searched to findoccurrences of single terms or more complex queries. This Service can beused on the NAS Backup Proxy Host to generate an index of the filesbacked up, and to be able to search this index to locate the backup thatneeds to be mounted to restore a particular file.

The Mount Service 4705 can be responsible for importing Virtual Volumes4709 and mounting the filesystems on these volumes. It can also exportthe mounted filesystems to other hosts in the Enterprise.

The Compliance and Other Services 4720 is, in some examples, a set ofservices that may optionally be deployed on the NAS Backup Proxy 4710host to perform advanced operations such as compliance auditing,e-discovery or long term archiving. These services can be capable ofperforming their selected actions on a mounted copy of a NAS filesystembackup. In some examples, an advantage of deploying these services onthe NAS Backup Proxy Host 4710 is that they have no impact on the NASServer and that the services scale up, that is, as more NAS Servers aredeployed, more NAS Backup Proxy Servers can be deployed, to keep pace.

Configuration of the Dataset

To pre-configure a NAS system for backup, the user can use the graphicaluser interface (GUI) of the Copy Management System (not illustrated) andspecify the NAS Server through its IP address or URL, and select one ormore of the NAS Backup Proxy hosts. The NAS Proxy Hosts will then mountthe NAS Filesystems 4702 and be ready for browsing.

To configure a subset of a NAS Filesystem for backup, the user can usethe GUI to browse the mounted filesystem, and select the startingdirectories to be backed up. This is called the NAS Backup Dataset. Theuser can also select a Service Level Template, which specifies thebackup frequencies and retentions across the various Copy Data StoragePools.

Flow of the First Backup

FIG. 48 is the sequence diagram illustrating the workflow of the firsttime data capture of the NAS system, according to some embodiments.

When the Copy Management System schedules the first backup of the NASServer as dictated by the Service Level Policies set by the user, itfollows a computerized process shown by the sequence diagram in FIG. 3,according to some embodiments.

The Orchestration Engine 4713 on the Copy Data Management Server 4707reads the NAS Backup Dataset configuration parameters by sending commandand queries to the NAS Server 4715. From this the Orchestration Engine4713 can derive which NAS Server is to be backed up, what subset of thefilesystem is to be backed up, and which NAS Backup Proxy host willparticipate.

The Orchestration Engine 4713 creates an appropriately sized stagingvirtual disk for the backup, and presents it to the correct NAS BackupProxy 4710. It then communicates with the Copy Service 4703 on the NASBackup Proxy 4710.

The Copy Service 4703 ensures that the NAS Filesystem is still mounted.It then formats the virtual disk and creates a target filesystem on thevirtual disk. The type of filesystem created depends on the supportedfilesystem types and the type of NAS that is being backed up. Toproperly backup a CIFS based NAS Filesystem, the NAS Backup Proxy hostwill usually be a Windows host, and will use an NTFS filesystem. For anNFS based NAS Filesystem, a Linux based NAS Backup Proxy host ispreferred, and an ext3 filesystem is usually deployed.

Next, either the Orchestration Engine 4713 or the Copy Service 4703 willcreate a NAS Filesystem snapshot, and will then copy data from the NASsnapshot to the target filesystem. Since this is the first backup of theparticular dataset, all of the files matching the dataset criteria willhave to be copied to the target file system.

While copying the files, the Copy Service 4703 will generate a list ofthe files copied including selected metadata. The metadata will includethe pathname of the file, the dates of creation and modification, theowner, permission and potentially other attributes. When backups aremounted as CIFS, there is a choice of share level permissions or filelevel permissions. Share level permissions grant access to a networknode associated with a share point, while file level permissions grantaccess to individual objects such as files and folders.

At the end of the copy process, the Copy Service 4703 will pass themetadata list to the Search and Indexing Service 4704 in order togenerate an index of the files backed up.

Next, the Orchestration Engine 4713 or the Copy Service 4703 will invokea snapshot operation on the virtual disk. If the virtual disk snapshotservice is being used, the Engine will invoke the Virtual Disk SnapshotService 4712 on the Copy Management Server. If the Filesystem SnapshotService 4711 is to be invoked, the Copy Service 4703 shall invoke thesnapshot on the NAS Backup Proxy Host 4710.

Therefore, in some embodiments, when the NAS Server 4709 generates asnapshot, the snapshot is copied from the NAS Server to 4709 to themounted Virtual Volumes 4709. The Orchestration Engine 4713 takes asnapshot of the Virtual Volumes 4709 (e.g., using one of two filesystem-based snapshots services 4711 or 4712). The NAS snapshot storedon the NAS Server 4701 is not deleted because the process needs thesnapshot for the next backup to compare it against the new snapshot toidentify the changed data so only changed data needs to be transmittedto the Virtual Volumes 4709. For example, the NAS Server 4701 can usechange tracking to only copy changed (deleted) files. In some examples,the Virtual Volume 4709 is a live version·the snapshot of the VirtualVolumes 4709 is taken to back-up the live version, which can be retainedfor as long as the system is configured to retain the snapshots (e.g.,daily, weekly, etc.).

The last step of the backup is to catalog the backup, which recordsdetails of the backup that has been completed, including the date andtime, and the copy data storage that was used. The target filesystem isnow unmounted from the NAS Backup Proxy host, and the virtual diskunmapped from it.

Flow of Subsequent Backups

FIG. 49 is the sequence diagram for the flow of any subsequent backup ofa dataset after the first one, according to some embodiments.

The backup begins with the Orchestration Engine 4713 reading the datasetconfiguration and information from the catalog indicating the mostrecent backup. From the catalog, the Orchestration Engine 4713 learns ofthe previously used staging virtual disk. The Orchestration Engine mapsthis virtual disk to the appropriate NAS Backup Proxy 4710 host. It thensends a message to the Copy Service 4703 on the NAS Backup Proxy 4710host.

The Copy Service 4703 mounts the target filesystem, and then creates anew NAS snapshot. It uses the Management Service 4706 of the NAS Systemto compare the current snapshot and the previous one, and generates alist of files that were created, modified or deleted since the lastbackup.

The Copy Service 4703 then copies newly created and modified files fromthe mounted NAS filesystem to the target filesystem, and it also deletesfiles from the target filesystem if they were deleted from the NASFilesystem. At the end of this operation, the target filesystem looksjust like the NAS filesystem. While copying, the Copy Service 4703creates a list of the files that it handled, along with selectedmetadata.

Once the copying is completed, the Copy Service 4703 deletes the olderNAS snapshot. It then invokes the Index and Search Service 4703 on thelist of files created, copied or deleted.

The Search and Index Service processes the list of files provided to itand adds these file names and their metadata to the search database thatit maintains. This provides it with the ability to perform fast searcheson any of the filenames or other metadata, and identify the backup thatcontained these files.

Next the Orchestration Engine 4712 invokes the Virtual Disk Snapshotservice 4712 (or the Copy Service invokes the File Snapshot Service4711), to create a new snapshot of the staging disk. This disk iscataloged as the next successful backup, and the target filesystem isunmounted and the virtual disk is unmapped.

Therefore, only the changed data is copied to the mounted VirtualVolumes 4709, which updates the live copy on the Virtual Volumes 4709.The Orchestration Engine 4713 takes a snapshot of the Virtual Volumes4709 to backup the current version of the live copy on the VirtualVolumes 4709. Since the snapshots of the Virtual Volumes 4709 can beretained for as long as the system is configured to retain the snapshots(e.g., daily, weekly, etc.), this may result in multiple snapshots onthe Virtual Volumes 4709.

The second and subsequent backup, performed in this manner is a fullcopy of the NAS filesystem. Every file in the subset of the NAS Systemis on the target filesystem at the same version. Yet the full backup wasachieved by just copying the changed files from the NAS Filesystem,which is an enormous savings in I/O. This is what allows the Copy DataManagement system to handle very large NAS systems.

Mounting of a NAS Backup

FIG. 50 shows a sequence diagram for the flow of mounting and unmountinga NAS Backup to a customer system for a restore, according to someembodiments. In some embodiments, the NAS Backup Proxy 4710 may notperform a traditional restore (e.g., that returns data back to the NASserver 4701). The NAS Backup Proxy 4710 can be configured to mount asnapshot to the target device such that the target device can see (e.g.,and manipulate) the full data on the mounted snapshot without affectingor overwriting data on the NAS Server 4701.

The mount and unmounts operations can replace the restore in atraditional backup. Mounting of a NAS Backup allows the user to accessfiles as they used to be at the time of the backup. It is quicker thantraditional restore, because no data movement is involved. The timerequired is virtually independent of the size of the backup.

The mount operation can be triggered, for example, by the user using aGUI to invoke the service. The Orchestration Engine 4713 presents aselection list based on the filesystems that were backed up. The usermay select one or more datasets, and type in keywords to identify thedesired dataset(s). Keywords may include filenames or wildcard patterns,or owner names or any other indexed attributes.

The Orchestration Engine 4713 presents these keywords to the Indexingand Search Service 2504. The Search Service 4704 returns with a list ofbackups that matched the search query. Now the user selects one of thebackups, and the host to which the backup is to be mounted.

The Orchestration Engine 4713 converts the search engine results intothe name of a virtual disk (or filesystem) snapshot. If required, theOrchestration Engine 4713 creates a writable clone from the snapshot,and presents this clone to the NAS Backup Proxy host. For example, whilesome snapshot functions allow the data to be modified, some snapshotfunctions require making a clone of the snapshot before it is writable.

There, the Mount Service 4703 mounts the filesystem from the virtualdisk, and then exports this as a NAS Share to the user selected host.

Depending on the level of access available on the user selected host,the Orchestration Engine 4713 or the User will mount the share on theselected host, and will have full access to the backed up data.

The user can now examine the files in the mounted filesystem on theselected host. The user can copy files, run programs or even makemodifications to the mounted files. The mounted filesystem is based on awritable clone of the original snapshot, so the snapshot is unaffectedby modifications.

After the user is done with the mounted filesystem, the user invokes theunmount operation. The filesystem is unmounted from the selected host.Then the virtual disk snapshot is unmounted from the NAS Backup Proxy,and the virtual disk is unmapped. Last of all, the writable clone isdestroyed.

Comparing Virtual Disk Snapshots and Filesystem Snapshots

The Copy Data Management System can be configured to support multiple(e.g., two) different snapshot mechanisms for creating snapshots of thestaging disk after the copying of data is completed.

In some embodiments, one mechanism is the Virtual Disk Snapshot Servicethat runs on the Copy Data Server 4712 as shown in FIG. 48. This serviceis capable of creating snapshots of virtual disks using a storagehypervisor and a copy-on-write technology.

The other mechanism is the Filesystem Snapshot Service, 4711 as shown inFIG. 47. This service runs on the NAS Backup Proxy host. This servicesuses a filesystem based snapshot capability built on allocate on writetechnology.

One or both of these services may be available in any particularconfiguration. If both are available, which one is used depends on theexact requirements of the user.

FIG. 51 is an exemplary table that compares features of the two Snapshotservices, according to some embodiments. In some embodiments, theVirtual Disk snapshot service provides for higher performance snapshotsthan the Filesystem Snapshot service. However, in some implementationsthe Filesystem Snapshot service scales better, since each NAS BackupProxy will have its own instance of the Filesystem Snapshot service.

The subject matter described herein can be implemented in digitalelectronic circuitry, or in computer software, firmware, or hardware,including the structural means disclosed in this specification andstructural equivalents thereof, or in combinations of them. The subjectmatter described herein can be implemented as one or more computerprogram products, such as one or more computer programs tangiblyembodied in an information carrier (e.g., in a machine readable storagedevice), or embodied in a propagated signal, for execution by, or tocontrol the operation of, data processing apparatus (e.g., aprogrammable processor, a computer, or multiple computers). A computerprogram (also known as a program, software, software application, orcode) can be written in any form of programming language, includingcompiled or interpreted languages, and it can be deployed in any form,including as a stand-alone program or as a module, component,subroutine, or other unit suitable for use in a computing environment. Acomputer program does not necessarily correspond to a file. A programcan be stored in a portion of a file that holds other programs or data,in a single file dedicated to the program in question, or in multiplecoordinated files (e.g., files that store one or more modules, subprograms, or portions of code). A computer program can be deployed to beexecuted on one computer or on multiple computers at one site ordistributed across multiple sites and interconnected by a communicationnetwork.

The processes and logic flows described in this specification, includingthe method steps of the subject matter described herein, can beperformed by one or more programmable processors executing one or morecomputer programs to perform functions of the subject matter describedherein by operating on input data and generating output. The processesand logic flows can also be performed by, and apparatus of the subjectmatter described herein can be implemented as, special purpose logiccircuitry, e.g., an FPGA (field programmable gate array) or an ASIC(application specific integrated circuit).

Processors suitable for the execution of a computer program include, byway of example, both general and special purpose microprocessors, andany one or more processor of any kind of digital computer. Generally, aprocessor will receive instructions and data from a read only memory ora random access memory or both. The essential elements of a computer area processor for executing instructions and one or more memory devicesfor storing instructions and data. Generally, a computer will alsoinclude, or be operatively coupled to receive data from or transfer datato, or both, one or more mass storage devices for storing data, e.g.,magnetic, magneto optical disks, or optical disks. Information carrierssuitable for embodying computer program instructions and data includeall forms of non volatile memory, including by way of examplesemiconductor memory devices, (e.g., EPROM, EEPROM, and flash memorydevices); magnetic disks, (e.g., internal hard disks or removabledisks); magneto optical disks; and optical disks (e.g., CD and DVDdisks). The processor and the memory can be supplemented by, orincorporated in, special purpose logic circuitry.

To provide for interaction with a user, the subject matter describedherein can be implemented on a computer having a display device, e.g., aCRT (cathode ray tube) or LCD (liquid crystal display) monitor, fordisplaying information to the user and a keyboard and a pointing device,(e.g., a mouse or a trackball), by which the user can provide input tothe computer. Other kinds of devices can be used to provide forinteraction with a user as well. For example, feedback provided to theuser can be any form of sensory feedback, (e.g., visual feedback,auditory feedback, or tactile feedback), and input from the user can bereceived in any form, including acoustic, speech, or tactile input.

The subject matter described herein can be implemented in a computingsystem that includes a back end component (e.g., a data server), amiddleware component (e.g., an application server), or a front endcomponent (e.g., a client computer having a graphical user interface ora web browser through which a user can interact with an implementationof the subject matter described herein), or any combination of such backend, middleware, and front end components. The components of the systemcan be interconnected by any form or medium of digital datacommunication, e.g., a communication network. Examples of communicationnetworks include a local area network (“LAN”) and a wide area network(“WAN”), e.g., the Internet.

It is to be understood that the disclosed subject matter is not limitedin its application to the details of construction and to thearrangements of the components set forth in the following description orillustrated in the drawings. The disclosed subject matter is capable ofother embodiments and of being practiced and carried out in variousways. Also, it is to be understood that the phraseology and terminologyemployed herein are for the purpose of description and should not beregarded as limiting.

As such, those skilled in the art will appreciate that the conception,upon which this disclosure is based, may readily be utilized as a basisfor the designing of other structures, methods, and systems for carryingout the several purposes of the disclosed subject matter. It isimportant, therefore, that the claims be regarded as including suchequivalent constructions insofar as they do not depart from the spiritand scope of the disclosed subject matter.

Although the disclosed subject matter has been described and illustratedin the foregoing exemplary embodiments, it is understood that thepresent disclosure has been made only by way of example, and thatnumerous changes in the details of implementation of the disclosedsubject matter may be made without departing from the spirit and scopeof the disclosed subject matter, which is limited only by the claimswhich follow.

What is claimed is:
 1. A computerized method of creating an incrementalbackup of application data by creating a snapshot associated with acurrent incremental backup of a data file using a change tracking bitmapsuch that a data file associated with the current incremental backup canbe restored from just the snapshot associated with the currentincremental backup and an initial backup without needing to access oneor more previously generated incremental backups of the data file, eachcreated at an earlier point in time than the point in time for thecurrent incremental backup, the method comprising: receiving, by acomputing device, a data file to be monitored by the computing device;identifying, by the computing device, a prior change tracking bitmapassociated with the data file, the prior change tracking bitmapcomprising data indicative of changes made since a backup created at anearlier point in time than the point in time for the current incrementalbackup; determining, by the computing device, blocks of data of the datafile changed since the prior change tracking bitmap for the priorincremental backup; transmitting, by the computing device, to a backupdevice blocks of data of the data file changed since the prior changetracking bitmap for the prior incremental backup; and creating, by thecomputing device, a copy-on-write snapshot of the backup device tocapture a point-in-time state of the data file, such that the data fileassociated with the current incremental backup can be restored from justthe snapshot associated with the current incremental backup and theinitial backup without needing to access one or more previouslygenerated incremental backups of the data file, each created at anearlier point in time than the point in time for the current incrementalbackup.
 2. The computerized method of claim 1, wherein the backup deviceincludes data indicative of all changes made for each of a set ofbackups created at an earlier point in time other than the point in timefor the current incremental backup.
 3. The computerized method of claim2, further comprising transmitting instructions, from a computingdevice, to a backup application to create a current change trackingbitmap associated with the current incremental backup, the currentchange tracking bitmap including: a copy of the blocks of data changedsince the prior change tracking bitmap, and all the changes from theprevious change tracking bitmap, such that the current change trackingbitmap can be used by future backups.
 4. The method of claim 3, furthercomprising deleting, by the computing device, the prior change trackingbitmap after creating the current change tracking bitmap.
 5. The methodof claim 1, further comprising, wherein if the change tracking bitmapdoes not exist, transmitting instructions to the backup application tocopy the entire data file to create an initial backup of the data fileand to create an initial change tracking bitmap for tracking changesmade to the data file after generation of the initial backup.
 6. Themethod of claim 1, wherein determining if the data file has a priorchange tracking bitmap comprises determining if the prior changetracking bitmap is reliable.
 7. The method of claim 1, whereinreceiving, by a change tracking drive, a data file to be monitoredfurther comprises determining if the data file is eligible for changetracking.
 8. The method of claim 1, wherein the data file comprises atleast one of a database file and a virtual file.
 9. The method of claim8, wherein the virtual file comprises at least one of a configurationfile and a virtual hard disk file for a virtual machine, facilitatingnear instant restore and cloning of previously backed up virtualmachines.
 10. The method of claim 1, wherein the backup created at anearlier point in time comprises a backup created most recent in time tothe current incremental backup.
 11. A non-transitory computer-readablemedium storing computer-readable instructions that, when executed,instruct a processor to perform processes comprising: receiving, by acomputing device, a database file to be monitored by the computingdevice; identifying, by the computing device, a prior change trackingbitmap associated with the database file, the prior change trackingbitmap comprising data indicative of changes made since a backup createdat an earlier point in time than the point in time for the currentincremental backup; determining, by the computing device, blocks of dataof the database file changed since the prior change tracking bitmap forthe prior incremental backup; transmitting, by the computing device, toa backup device blocks of data of the database file changed since theprior change tracking bitmap for the prior incremental backup; creating,by the computing device, a copy-on-write snapshot of the backup deviceto capture a point-in-time state of the data file, such that thedatabase file associated with the current incremental backup can berestored from just the snapshot associated with the current incrementalbackup and the initial backup without needing to access one or morepreviously generated incremental backups of the database file, eachcreated at an earlier point in time than the point in time for thecurrent incremental backup.
 12. The non-transitory computer-readablemedium of claim 11, wherein the backup device includes data indicativeof all changes made for each of a set of backups created at an earlierpoint in time other than the point in time for the current incrementalbackup.
 13. The non-transitory computer-readable medium of claim 12,further comprising transmitting instructions, from a computing device,to a backup application to create a current change tracking bitmapassociated with the current incremental backup, the current changetracking bitmap including: a copy of the blocks of data changed sincethe prior change tracking bitmap, and all the changes from the previouschange tracking bitmap, such that the current change tracking bitmap canbe used by future backups.
 14. The non-transitory computer-readablemedium of claim 13, further comprising deleting, by the computingdevice, the prior change tracking bitmap after creating the currentchange tracking bitmap.
 15. The non-transitory computer-readable mediumof claim 11, further comprising, wherein if the change tracking bitmapdoes not exist, transmitting instructions to the backup application tocopy the entire database file to create an initial backup of thedatabase file and to create an initial change tracking bitmap fortracking changes made to the data file after generation of the initialbackup.
 16. The non-transitory computer-readable medium of claim 11,wherein determining if the database file has a prior change trackingbitmap comprises determining if the prior change tracking bitmap isreliable.
 17. The non-transitory computer-readable medium of claim 11,wherein receiving, by a change tracking drive, a database file to bemonitored further comprises determining if the database file is eligiblefor change tracking.
 18. A system for creating an incremental backup ofapplication data by creating a snapshot associated with a currentincremental backup of a data file using a change tracking bitmap suchthat a data file associated with the current incremental backup can berestored from just the snapshot associated with the current incrementalbackup and an initial backup without needing to access one or morepreviously generated incremental backups of the data file, each createdat an earlier point in time than the point in time for the currentincremental backup, the system comprising: a memory containinginstructions for execution by a processor; the processor configured to:receive a data file to be monitored by the computing device; identify aprior change tracking bitmap associated with the data file, the priorchange tracking bitmap comprising data indicative of changes made sincea backup created at an earlier point in time than the point in time forthe current incremental backup; determine blocks of data of the datafile changed since the prior change tracking bitmap for the priorincremental backup; transmit, to a backup device blocks of data of thedata file changed since the prior change tracking bitmap for the priorincremental backup; and create a copy-on-write snapshot of the backupdevice to capture a point-in-time state of the data file, such that thedata file associated with the current incremental backup can be restoredfrom just the snapshot associated with the current incremental backupand the initial backup without needing to access one or more previouslygenerated incremental backups of the data file, each created at anearlier point in time than the point in time for the current incrementalbackup.
 19. The system of claim 18, wherein the backup device includesdata indicative of all changes made for each of a set of backups createdat an earlier point in time other than the point in time for the currentincremental backup.
 20. The system of claim 19, wherein the processor isfurther configured to transmit instructions to a backup application tocreate a current change tracking bitmap associated with the currentincremental backup, the current change tracking bitmap including: a copyof the blocks of data changed since the prior change tracking bitmap,and all the changes from the previous change tracking bitmap, such thatthe current change tracking bitmap can be used by future backups.